本篇博文主要展示 2024-09-10 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天10:30左右邮件定时自动发送。

目录

概览 (2024-09-10)

今日共更新658篇论文,其中:

  • 自然语言处理66篇(Computation and Language (cs.CL))
  • 人工智能169篇(Artificial Intelligence (cs.AI))
  • 计算机视觉169篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习191篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
[NLP-0] MMEvo:利用Evol-Direct增强多模式大型语言模型

链接: https://arxiv.org/abs/2409.05840
作者: Run Luo,Haonan Zhang,Longze Chen,Ting-En Lin,Xiong Liu,Yuchuan Wu,Min Yang,Minzheng Wang,Pengpeng Zeng,Lianli Gao,Heng Tao Shen,Yunshui Li,Xiaobo Xia,Fei Huang,Jingkuan Song,Yongbin Li
关键词-EN: Large Language Models, Multimodal Large Language, Large Language, Language Models, Multimodal Large
关键词-ZH: 大型语言模型,多模式大型语言,大型语言,语言模型,多模式大型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The development of Multimodal Large Language Models (MLLMs) has seen significant advancements. However, the quantity and quality of multimodal instruction data have emerged as significant bottlenecks in their progress. Manually creating multimodal instruction data is both time-consuming and inefficient, posing challenges in producing instructions of high complexity. Moreover, distilling instruction data from black-box commercial models (e.g., GPT-4o, GPT-4V) often results in simplistic instruction data, which constrains performance to that of these models. The challenge of curating diverse and complex instruction data remains substantial. We propose MMEvol, a novel multimodal instruction data evolution framework that combines fine-grained perception evolution, cognitive reasoning evolution, and interaction evolution. This iterative approach breaks through data quality bottlenecks to generate a complex and diverse image-text instruction dataset, thereby empowering MLLMs with enhanced capabilities. Beginning with an initial set of instructions, SEED-163K, we utilize MMEvol to systematically broadens the diversity of instruction types, integrates reasoning steps to enhance cognitive capabilities, and extracts detailed information from images to improve visual understanding and robustness. To comprehensively evaluate the effectiveness of our data, we train LLaVA-NeXT using the evolved data and conduct experiments across 13 vision-language tasks. Compared to the baseline trained with seed data, our approach achieves an average accuracy improvement of 3.1 points and reaches state-of-the-art (SOTA) performance on 9 of these tasks.
摘要:多通道大型语言模型(MLLMS)的发展取得了重大进展。然而,多模式教学数据的数量和质量已成为其进展的重大瓶颈。手动创建多模式指令数据既耗时又低效,给生成高复杂性的指令带来了挑战。此外,从黑盒商业模型(例如,GPT-40、GPT-4V)中提取指令数据通常会导致指令数据过于简单,这限制了这些模型的性能。管理多样和复杂的教学数据的挑战仍然很大。提出了一种结合细粒度感知进化、认知推理进化和交互进化的多通道教学数据进化框架MMEVOL。这种迭代方法突破了数据质量瓶颈,生成了复杂多样的图文指令数据集,从而增强了MLLMS的能力。从最初的指令集SEED-163K开始,我们利用MMEVOL系统地扩大指令类型的多样性,整合推理步骤以增强认知能力,并从图像中提取详细信息以提高视觉理解和稳健性。为了全面评估我们数据的有效性,我们使用进化的数据训练LLaVA-Next,并在13个视觉语言任务中进行实验。与使用种子数据训练的基线相比,我们的方法平均提高了3.1个点的准确率,并在其中9个任务上达到了最先进的性能(SOTA)。

[NLP-1] Improving Pretraining Data Using Perplexity Correlations
[NLP-1] 使用困惑相关性改进预训练数据

链接: https://arxiv.org/abs/2409.05816
作者: Tristan Thrush,Christopher Potts,Tatsunori Hashimoto
关键词-EN: high-performance language models, Quality pretraining data, Quality pretraining, language models, key to high-performance
关键词-ZH: 高性能语言模型、优质预训练数据、优质预训练、语言模型、高性能的关键
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Quality pretraining data is often seen as the key to high-performance language models. However, progress in understanding pretraining data has been slow due to the costly pretraining runs required for data selection experiments. We present a framework that avoids these costs and selects high-quality pretraining data without any LLM training of our own. Our work is based on a simple observation: LLM losses on many pretraining texts are correlated with downstream benchmark performance, and selecting high-correlation documents is an effective pretraining data selection method. We build a new statistical framework for data selection centered around estimates of perplexity-benchmark correlations and perform data selection using a sample of 90 LLMs taken from the Open LLM Leaderboard on texts from tens of thousands of web domains. In controlled pretraining experiments at the 160M parameter scale on 8 benchmarks, our approach outperforms DSIR on every benchmark, while matching the best data selector found in DataComp-LM, a hand-engineered bigram classifier.
摘要:高质量的预训练数据通常被视为高性能语言模型的关键。然而,由于数据选择实验所需的昂贵的预训练运行,在理解预训练数据方面进展缓慢。我们提出了一个框架,避免了这些成本,并选择了高质量的预训练数据,而不需要我们自己进行任何LLM训练。我们的工作基于一个简单的观察:许多预训练文本上的LLM损失与下游基准性能相关,选择相关性高的文档是一种有效的预训练数据选择方法。我们构建了一个新的统计框架,用于围绕困惑-基准相关性的估计进行数据选择,并使用来自Open LLM排行榜的90个LLM样本进行数据选择,这些样本来自数万个Web域的文本。在8个基准的160m参数尺度上的受控预训练实验中,我们的方法在每个基准上都优于DSIR,同时与手工设计的二元语法分类器DataComp-LM中发现的最佳数据选择器相匹配。

[NLP-2] Benchmarking Chinese Knowledge Rectification in Large Language Models
[NLP-2] 大型语言模型中的中文知识整顿对标

链接: https://arxiv.org/abs/2409.05806
作者: Tianhe Lu,Jizhan Fang,Yunzhi Yao,Xin Xu,Ningyu Zhang,Huajun Chen
关键词-EN: Large Language Models, exhibit remarkable generative, remarkable generative capabilities, Language Models, Large Language
关键词-ZH: 大型语言模型,表现出非凡的生成能力,语言模型,大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Ongoing work; code and dataset are available at this https URL

点击查看摘要

Abstract:While Large Language Models (LLMs) exhibit remarkable generative capabilities, they are not without flaws, particularly in the form of hallucinations. This issue is even more pronounced when LLMs are applied to specific languages and domains. For example, LLMs may generate nonsense information when handling Chinese ancient poetry, proverbs, or idioms, owing to the lack of specific knowledge. To this end, this paper introduces a benchmark for rectifying Chinese knowledge in LLMs via knowledge editing. Specifically, we introduce a new Chinese dataset, CKnowEdit, by collecting seven type of knowledge from various sources, including classical texts, idioms, and content from Baidu Tieba Ruozhiba, thereby accounting for the unique polyphony, antithesis, and logical constructs inherent in the Chinese language. Through the analysis of this dataset, we uncover the challenges faced by current LLMs in mastering Chinese. Furthermore, our evaluation of state-of-the-art knowledge editing techniques on this dataset unveil the substantial scope for advancement in the rectification of Chinese knowledge. Code and dataset are available at this https URL.
摘要:虽然大型语言模型(LLM)显示出非凡的生成能力,但它们也并非没有缺陷,特别是在幻觉的形式上。当LLM被应用到特定的语言和领域时,这个问题更加明显。例如,由于缺乏特定的知识,LLMS在处理中国古代诗歌、谚语或成语时可能会产生无稽之谈的信息。为此,本文提出了一种基于知识编辑的LLMS中文知识纠错基准。具体地说,我们引入了一个新的中文数据集CKnowEdit,通过从各种来源收集七种类型的知识,包括经典文本、成语和百度贴吧若指巴的内容,从而解释了汉语中固有的独特的复调、对偶和逻辑结构。通过对这个语料集的分析,我们揭示了目前的汉语学习者在掌握汉语方面所面临的挑战。此外,我们对这一数据集上最先进的知识编辑技术的评估揭示了中文知识纠正方面的实质性进步空间。代码和数据集可在此HTTPS URL上找到。

[NLP-3] PDAF: A Phonetic Debiasing Attention Framework For Speaker Verification
[NLP-3] PDAF:用于说话者验证的语音去偏注意框架

链接: https://arxiv.org/abs/2409.05799
作者: Massa Baali,Abdulhamid Aldoobi,Hira Dhamyal,Rita Singh,Bhiksha Raj
关键词-EN: Speaker verification, Speaker verification systems, Debiasing Attention Framework, Phoneme Debiasing Attention, phonetic dominance
关键词-ZH: 说话者验证、说话者验证系统、去偏注意力框架、音素去偏注意力、语音主导
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: Accepted to SLT

点击查看摘要

Abstract:Speaker verification systems are crucial for authenticating identity through voice. Traditionally, these systems focus on comparing feature vectors, overlooking the speech’s content. However, this paper challenges this by highlighting the importance of phonetic dominance, a measure of the frequency or duration of phonemes, as a crucial cue in speaker verification. A novel Phoneme Debiasing Attention Framework (PDAF) is introduced, integrating with existing attention frameworks to mitigate biases caused by phonetic dominance. PDAF adjusts the weighting for each phoneme and influences feature extraction, allowing for a more nuanced analysis of speech. This approach paves the way for more accurate and reliable identity authentication through voice. Furthermore, by employing various weighting strategies, we evaluate the influence of phonetic features on the efficacy of the speaker verification system.
摘要:说话人验证系统对于通过语音验证身份至关重要。传统上,这些系统专注于比较特征载体,而忽略了语音内容。然而,本文通过强调语音优势(音素频率或持续时间的衡量标准)作为说话者验证中的关键线索的重要性来挑战这一点。引入了一种新颖的音素去偏置注意力框架(PDAF),与现有的注意力框架相结合,以减轻语音主导引起的偏见。PDAF调整每个音素的权重并影响特征提取,从而对语音进行更细致的分析。这种方法为通过语音进行更准确、更可靠的身份认证铺平了道路。此外,通过采用各种加权策略,我们评估语音特征对说话人验证系统功效的影响。

[NLP-4] Evidence from fMRI Supports a Two-Phase Abstraction Process in Language Models NEURIPS
[NLP-4] fMRI的证据支持语言模型中的两阶段抽象过程

链接: https://arxiv.org/abs/2409.05771
作者: Emily Cheng,Richard J. Antonello
关键词-EN: hidden states extracted, predict measured brain, measured brain response, Research has repeatedly, intermediate hidden states
关键词-ZH: 提取隐藏状态,预测测量的大脑,测量的大脑反应,研究反复,中间隐藏状态
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Equal contribution from both authors. Submitted to NeurIPS NeuroAI workshop 2024

点击查看摘要

Abstract:Research has repeatedly demonstrated that intermediate hidden states extracted from large language models are able to predict measured brain response to natural language stimuli. Yet, very little is known about the representation properties that enable this high prediction performance. Why is it the intermediate layers, and not the output layers, that are most capable for this unique and highly general transfer task? In this work, we show that evidence from language encoding models in fMRI supports the existence of a two-phase abstraction process within LLMs. We use manifold learning methods to show that this abstraction process naturally arises over the course of training a language model and that the first “composition” phase of this abstraction process is compressed into fewer layers as training continues. Finally, we demonstrate a strong correspondence between layerwise encoding performance and the intrinsic dimensionality of representations from LLMs. We give initial evidence that this correspondence primarily derives from the inherent compositionality of LLMs and not their next-word prediction properties.
摘要:研究一再证明,从大型语言模型中提取的中间隐藏状态能够预测大脑对自然语言刺激的测量反应。然而,人们对实现这种高预测性能的表示属性知之甚少。为什么是中间层,而不是输出层,最有能力完成这项独特且高度通用的传输任务?在这项工作中,我们证明了fMRI中语言编码模型的证据支持LLMS中存在两个阶段的抽象过程。我们使用多种学习方法来证明,这个抽象过程是在训练语言模型的过程中自然产生的,并且随着训练的继续,这个抽象过程的第一个“合成”阶段被压缩到更少的层。最后,我们证明了LayerWise编码性能与LLMS表示的内在维度之间的强对应关系。我们给出的初步证据表明,这种对应关系主要来自于LLM固有的构成性,而不是它们的下一个词预测特性。

[NLP-5] owards Democratizing Multilingual Large Language Models For Medicine Through A Two-Stage Instruction Fine-tuning Approach
[NLP-5] 通过两阶段教学微调方法民主化医学多语言大型语言模型

链接: https://arxiv.org/abs/2409.05732
作者: Meng Zhou,Surajsinh Parmar,Anubhav Bhatti
关键词-EN: serve linguistically diverse, linguistically diverse populations, potential to serve, serve linguistically, Open-source
关键词-ZH: 为语言多样性、语言多样性人群提供服务的潜力、语言服务、开源
类目: Computation and Language (cs.CL)
备注: Technical Report v1, work in progress

点击查看摘要

Abstract:Open-source, multilingual medical large language models (LLMs) have the potential to serve linguistically diverse populations across different regions. Adapting generic LLMs for healthcare often requires continual pretraining, but this approach is computationally expensive and sometimes impractical. Instruction fine-tuning on a specific task may not always guarantee optimal performance due to the lack of broader domain knowledge that the model needs to understand and reason effectively in diverse scenarios. To address these challenges, we introduce two multilingual instruction fine-tuning datasets, MMed-IFT and MMed-IFT-MC, containing over 200k high-quality medical samples in six languages. We propose a two-stage training paradigm: the first stage injects general medical knowledge using MMed-IFT, while the second stage fine-tunes task-specific multiple-choice questions with MMed-IFT-MC. Our method achieves competitive results on both English and multilingual benchmarks, striking a balance between computational efficiency and performance. We plan to make our dataset and model weights public at \urlthis https URL in the future.
摘要:开源、多语言的医学大型语言模型(LLM)有可能服务于不同地区不同语言的人群。将通用LLM用于医疗保健通常需要持续的预培训,但这种方法计算昂贵,有时不切实际。由于缺乏更广泛的领域知识,模型需要在不同的场景中有效地理解和推理,因此对特定任务的指令微调可能并不总是保证最佳性能。为了应对这些挑战,我们引入了两个多语言教学微调数据集MMEd-IFT和MMEd-IFT-MC,包含六种语言的超过20万个高质量医学样本。我们提出了一个两阶段的训练范式:第一阶段使用MMEd-IFT注入一般医学知识,第二阶段使用MMEd-IFT-MC微调特定任务的多项选择题。我们的方法在英语和多语言基准上都取得了有竞争力的结果,在计算效率和性能之间取得了平衡。我们计划将来在此HTTPS URL上公开我们的数据集和模型权重。

[NLP-6] Referring Expression Generation in Visually Grounded Dialogue with Discourse-aware Comprehension Guiding
[NLP-6] 基于话语感知的理解引导下的视觉对话中的提及表达生成

链接: https://arxiv.org/abs/2409.05721
作者: Bram Willemsen,Gabriel Skantze
关键词-EN: referring expression generation, produce referring expressions, visually grounded dialogue, expression generation, referring expression
关键词-ZH: 指代表达生成,产生指代表达,基于视觉的对话,表达生成,指代表达
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at INLG 2024

点击查看摘要

Abstract:We propose an approach to referring expression generation (REG) in visually grounded dialogue that is meant to produce referring expressions (REs) that are both discriminative and discourse-appropriate. Our method constitutes a two-stage process. First, we model REG as a text- and image-conditioned next-token prediction task. REs are autoregressively generated based on their preceding linguistic context and a visual representation of the referent. Second, we propose the use of discourse-aware comprehension guiding as part of a generate-and-rerank strategy through which candidate REs generated with our REG model are reranked based on their discourse-dependent discriminatory power. Results from our human evaluation indicate that our proposed two-stage approach is effective in producing discriminative REs, with higher performance in terms of text-image retrieval accuracy for reranked REs compared to those generated using greedy decoding.
摘要:我们提出了一种在基于视觉的对话中进行指代表达生成(REG)的方法,旨在产生既具有歧视性又适合话语的指代表达(RE)。我们的方法由两阶段过程组成。首先,我们将REG建模为文本和图像条件下一个令牌预测任务。RE是基于其先前的语言上下文和所指对象的视觉表示自回归生成的。其次,我们建议使用话语感知理解指导作为生成和重新排序策略的一部分,通过该策略,使用我们的REG模型生成的候选RE根据其依赖于话语的歧视力进行重新排序。我们的人类评估结果表明,我们提出的两阶段方法在生成区分性RE方面是有效的,与使用贪婪解码生成的RE相比,重新排序的RE在文本图像检索准确性方面具有更高的性能。

[NLP-7] RegNLP in Action: Facilitating Compliance Through Automated Information Retrieval and Answer Generation
[NLP-7] RegNLP正在行动:通过自动化信息检索和答案生成促进合规

链接: https://arxiv.org/abs/2409.05677
作者: Tuba Gokhan,Kexin Wang,Iryna Gurevych,Ted Briscoe
关键词-EN: governmental regulatory bodies, Natural Language Processing, compliance.Regulatory Natural Language, Regulatory Information Retrieval, issued by governmental
关键词-ZH: 政府监管机构,自然语言处理,合规性。监管自然语言,监管信息检索,由政府发布
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Emerging Technologies (cs.ET); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Regulatory documents, issued by governmental regulatory bodies, establish rules, guidelines, and standards that organizations must adhere to for legal compliance. These documents, characterized by their length, complexity and frequent updates, are challenging to interpret, requiring significant allocation of time and expertise on the part of organizations to ensure ongoing compliance.Regulatory Natural Language Processing (RegNLP) is a multidisciplinary subfield aimed at simplifying access to and interpretation of regulatory rules and obligations. We define an Automated Question-Passage Generation task for RegNLP, create the ObliQA dataset containing 27,869 questions derived from the Abu Dhabi Global Markets (ADGM) financial regulation document collection, design a baseline Regulatory Information Retrieval and Answer Generation system, and evaluate it with RePASs, a novel evaluation metric that tests whether generated answers accurately capture all relevant obligations and avoid contradictions.
摘要:由政府监管机构发布的规范性文件确立了组织为合法合规必须遵守的规则、指导方针和标准。这些文件的特点是篇幅长、复杂且经常更新,很难解释,需要组织分配大量的时间和专业知识,以确保持续遵守。监管自然语言处理(RegNLP)是一个多学科的子领域,旨在简化对监管规则和义务的获取和解释。我们为RegNLP定义了一个自动问题通道生成任务,创建了包含27,869个来自阿布扎比全球市场(ADGM)金融监管文件集合的问题的ObliQA数据集,设计了一个基线监管信息检索和答案生成系统,并使用RePass进行评估,RePass是一种新的评估指标,用于测试生成的答案是否准确地捕获了所有相关义务并避免了矛盾。

[NLP-8] Evaluation of real-time transcriptions using end-to-end ASR models
[NLP-8] 使用端到端ASB模型评估实时传输

链接: https://arxiv.org/abs/2409.05674
作者: Carlos Arriaga,Alejandro Pozo,Javier Conde,Alvaro Alonso
关键词-EN: Automatic Speech Recognition, Automatic Speech, Speech Recognition, greatly evolved, ASR
关键词-ZH: 自动语音识别,自动语音,语音识别,大大发展,ASB
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 4 figures

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) or Speech-to-text (STT) has greatly evolved in the last few years. Traditional architectures based on pipelines have been replaced by joint end-to-end (E2E) architectures that simplify and streamline the model training process. In addition, new AI training methods, such as weak-supervised learning have reduced the need for high-quality audio datasets for model training. However, despite all these advancements, little to no research has been done on real-time transcription. In real-time scenarios, the audio is not pre-recorded, and the input audio must be fragmented to be processed by the ASR systems. To achieve real-time requirements, these fragments must be as short as possible to reduce latency. However, audio cannot be split at any point as dividing an utterance into two separate fragments will generate an incorrect transcription. Also, shorter fragments provide less context for the ASR model. For this reason, it is necessary to design and test different splitting algorithms to optimize the quality and delay of the resulting transcription. In this paper, three audio splitting algorithms are evaluated with different ASR models to determine their impact on both the quality of the transcription and the end-to-end delay. The algorithms are fragmentation at fixed intervals, voice activity detection (VAD), and fragmentation with feedback. The results are compared to the performance of the same model, without audio fragmentation, to determine the effects of this division. The results show that VAD fragmentation provides the best quality with the highest delay, whereas fragmentation at fixed intervals provides the lowest quality and the lowest delay. The newly proposed feedback algorithm exchanges a 2-4% increase in WER for a reduction of 1.5-2s delay, respectively, to the VAD splitting.
摘要:自动语音识别(ASR)或语音到文本(STT)在过去几年中有了很大的发展。基于管道的传统架构已被简化模型培训流程的联合端到端(E2E)架构所取代。此外,新的人工智能训练方法,如弱监督学习,减少了对模型训练对高质量音频数据集的需求。然而,尽管取得了这些进展,实时转录方面的研究却很少,甚至根本没有。在实时场景中,音频不是预先录制的,输入的音频必须分段才能由ASR系统处理。为了达到实时要求,这些片段必须尽可能短以减少延迟。然而,音频在任何时候都不能分割,因为将话语分成两个单独的片段会产生不正确的转录。此外,较短的片段为ASR模型提供的上下文较少。因此,有必要设计和测试不同的拆分算法,以优化结果转录的质量和延迟。本文使用不同的ASR模型对三种音频分割算法进行了评估,以确定它们对转录质量和端到端延迟的影响。这些算法包括固定间隔分段、语音活动检测(VAD)和带反馈的分段。将结果与没有音频碎片的相同模型的性能进行比较,以确定这种分割的效果。结果表明,VAD分片提供了最好的质量和最高的延迟,而固定间隔的分片提供了最低的质量和最低的延迟。新提出的反馈算法将WER提高2%-4%,以换取VAD分裂分别减少1.5-2s的延迟。

[NLP-9] Revisiting English Winogender Schemas for Consistency Coverage and Grammatical Case
[NLP-9] 重新审视英语Winogender一致性覆盖和语法案例的模式

链接: https://arxiv.org/abs/2409.05653
作者: Vagrant Gautam,Julius Steuer,Eileen Bingert,Ray Johns,Anne Lauscher,Dietrich Klakow
关键词-EN: coreference resolution, important goals, coreference, robustness in coreference, supervised coreference resolution
关键词-ZH: 共指解析、重要目标、共指、共指鲁棒性、监督共指解析
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While measuring bias and robustness in coreference resolution are important goals, such measurements are only as good as the tools we use to measure them with. Winogender schemas (Rudinger et al., 2018) are an influential dataset proposed to evaluate gender bias in coreference resolution, but a closer look at the data reveals issues with the instances that compromise their use for reliable evaluation, including treating different grammatical cases of pronouns in the same way, violations of template constraints, and typographical errors. We identify these issues and fix them, contributing a new dataset: Winogender 2.0. Our changes affect performance with state-of-the-art supervised coreference resolution systems as well as all model sizes of the language model FLAN-T5, with F1 dropping on average 0.1 points. We also propose a new method to evaluate pronominal bias in coreference resolution that goes beyond the binary. With this method and our new dataset which is balanced for grammatical case, we empirically demonstrate that bias characteristics vary not just across pronoun sets, but also across surface forms of those sets.
摘要:虽然测量共指分辨率中的偏差和稳健性是重要的目标,但这种测量只取决于我们用来测量它们的工具。Wingender图式(Rudinger等人,2018年)是一个有影响力的数据集,旨在评估共指解析中的性别偏见,但仔细观察这些数据会发现实例存在的问题,这些实例损害了它们的使用以进行可靠的评估,包括以相同的方式处理代词的不同语法情况、违反模板限制和排版错误。我们发现并修复了这些问题,贡献了一个新的数据集:Wingender 2.0。我们的变化影响了最先进的监督共指解析系统的性能,以及语言模型FRAN-T5的所有模型大小,F1平均下降0.1个点。我们还提出了一种新的方法来评估共指分辨率中的代词偏差,该方法超越了二进制。使用这种方法和我们的新数据集(在语法情况下平衡),我们经验地证明,倾向性特征不仅在代词集合之间变化,而且在这些集合的表面形式之间也是不同的。

[NLP-10] ExDDI: Explaining Drug-Drug Interaction Predictions with Natural Language
[NLP-10] ExDID:用自然语言解释药物相互作用预测

链接: https://arxiv.org/abs/2409.05592
作者: Zhaoyue Sun,Jiazheng Li,Gabriele Pergola,Yulan He
关键词-EN: improving medication safety, unknown drug-drug interactions, Predicting unknown drug-drug, predicting DDI categories, drug-drug interactions
关键词-ZH: 提高药物安全性、未知药物间相互作用、预测未知药物间相互作用、预测DDD类别、药物间相互作用
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 4 figures

点击查看摘要

Abstract:Predicting unknown drug-drug interactions (DDIs) is crucial for improving medication safety. Previous efforts in DDI prediction have typically focused on binary classification or predicting DDI categories, with the absence of explanatory insights that could enhance trust in these predictions. In this work, we propose to generate natural language explanations for DDI predictions, enabling the model to reveal the underlying pharmacodynamics and pharmacokinetics mechanisms simultaneously as making the prediction. To do this, we have collected DDI explanations from DDInter and DrugBank and developed various models for extensive experiments and analysis. Our models can provide accurate explanations for unknown DDIs between known drugs. This paper contributes new tools to the field of DDI prediction and lays a solid foundation for further research on generating explanations for DDI predictions.
摘要:预测未知的药物间相互作用(DDD)对于提高药物安全性至关重要。以前在DID预测方面的工作通常集中在二元分类或预测DID类别上,缺乏可以增强对这些预测信任的解释性见解。在这项工作中,我们建议为DID预测生成自然语言解释,使模型能够在做出预测的同时揭示潜在的药效学和药代动力学机制。为此,我们从DDInter和DrugBank收集了DID解释,并开发了各种模型用于广泛的实验和分析。我们的模型可以为已知药物之间的未知DID提供准确的解释。本文为DDD预测领域提供了新工具,并为进一步研究为DDD预测生成解释奠定了坚实的基础。

[NLP-11] MemoRAG: Moving towards Next-Gen RAG Via Memory-Inspired Knowledge Discovery
[NLP-11] MemoRAG:通过记忆启发的知识发现迈向下一代RAG

链接: https://arxiv.org/abs/2409.05591
作者: Hongjin Qian,Peitian Zhang,Zheng Liu,Kelong Mao,Zhicheng Dou
关键词-EN: large language models, access external databases, language models, optimized context, access external
关键词-ZH: 大型语言模型,访问外部数据库,语言模型,优化上下文,访问外部
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Codes and models are in this https URL

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) leverages retrieval tools to access external databases, thereby enhancing the generation quality of large language models (LLMs) through optimized context. However, the existing retrieval methods are constrained inherently, as they can only perform relevance matching between explicitly stated queries and well-formed knowledge, but unable to handle tasks involving ambiguous information needs or unstructured knowledge. Consequently, existing RAG systems are primarily effective for straightforward question-answering tasks. In this work, we propose \textbfMemoRAG, a novel retrieval-augmented generation paradigm empowered by long-term memory. MemoRAG adopts a dual-system architecture. On the one hand, it employs a \textitlight but long-range LLM to form the global memory of database. Once a task is presented, it generates draft answers, cluing the retrieval tools to locate useful information within the database. On the other hand, it leverages an \textitexpensive but expressive LLM, which generates the ultimate answer based on the retrieved information. Building on this general framework, we further optimize MemoRAG’s performance by enhancing its cluing mechanism and memorization capacity. In our experiment, MemoRAG achieves superior performance across a variety of evaluation tasks, including both complex ones where conventional RAG fails and straightforward ones where RAG is commonly applied.
摘要:检索-增强生成(RAG)利用检索工具访问外部数据库,从而通过优化上下文来提高大型语言模型(LLM)的生成质量。然而,现有的检索方法存在固有的局限性,它们只能在显式查询和良构知识之间进行相关性匹配,而不能处理涉及模糊信息需求或非结构化知识的任务。因此,现有的RAG系统主要用于简单的问题回答任务。在这项工作中,我们提出了一种新的基于长时记忆的提取增强生成范式–TextbfMemoRAG。MemoRAG采用双系统架构。一方面,它采用了一个轻量级但远程的LLM来构成数据库的全局内存。一旦提出一项任务,它就会生成答案草稿,为检索工具提供线索,以便在数据库中找到有用的信息。另一方面,它利用了代价高昂但富有表现力的LLM,该LLM基于检索到的信息生成最终答案。在这个框架的基础上,我们通过增强MemoRAG的线索机制和记忆能力来进一步优化MemoRAG的性能。在我们的实验中,MemoRAG在各种评估任务上都取得了优异的性能,包括传统RAG失败的复杂评估任务和通常应用RAG的简单评估任务。

[NLP-12] Spatially-Aware Speaker for Vision-and-Language Navigation Instruction Generation
[NLP-12] 用于视觉和语言导航指令生成的空间感知扬声器

链接: https://arxiv.org/abs/2409.05583
作者: Muraleekrishna Gopinathan,Martin Masek,Jumana Abu-Khalaf,David Suter
关键词-EN: execute human language, aims to develop, execute human, communicate in natural, develop robots
关键词-ZH: 执行人类语言,旨在开发、执行人类、在自然中交流、开发机器人
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Embodied AI aims to develop robots that can \textitunderstand and execute human language instructions, as well as communicate in natural languages. On this front, we study the task of generating highly detailed navigational instructions for the embodied robots to follow. Although recent studies have demonstrated significant leaps in the generation of step-by-step instructions from sequences of images, the generated instructions lack variety in terms of their referral to objects and landmarks. Existing speaker models learn strategies to evade the evaluation metrics and obtain higher scores even for low-quality sentences. In this work, we propose SAS (Spatially-Aware Speaker), an instruction generator or \textitSpeaker model that utilises both structural and semantic knowledge of the environment to produce richer instructions. For training, we employ a reward learning method in an adversarial setting to avoid systematic bias introduced by language evaluation metrics. Empirically, our method outperforms existing instruction generation models, evaluated using standard metrics. Our code is available at \urlthis https URL.
摘要:嵌入式人工智能的目标是开发能够理解和执行人类语言指令并能用自然语言交流的机器人。在这方面,我们研究了生成高度详细的导航指令以供机器人遵循的任务。尽管最近的研究表明,从图像序列生成循序渐进的指令方面有了显著的飞跃,但生成的指令在涉及物体和地标方面缺乏多样性。现有的说话人模型学习策略来规避评价度量,即使对于低质量的句子也能获得更高的分数。在这项工作中,我们提出了SAS(空间感知说话人),这是一个指令生成器或文本说话人模型,它利用环境的结构和语义知识来产生更丰富的指令。对于训练,我们在对抗性环境下使用奖励学习方法,以避免语言评估指标引入的系统性偏差。经验上,我们的方法优于现有的指令生成模型,使用标准度量进行评估。我们的代码位于此HTTPS URL。

[NLP-13] SciAgents : Automating scientific discovery through multi-agent intelligent graph reasoning
[NLP-13] SciAgents:通过多智能体智能图推理自动化科学发现

链接: https://arxiv.org/abs/2409.05556
作者: Alireza Ghafarollahi,Markus J. Buehler
关键词-EN: identifying complex patterns, advancing scientific understanding, uncovering previously unseen, previously unseen connections, vast scientific data
关键词-ZH: 识别复杂模式,推进科学理解,发现以前未见过的、以前未见过的联系、大量的科学数据
类目: Artificial Intelligence (cs.AI); Disordered Systems and Neural Networks (cond-mat.dis-nn); Materials Science (cond-mat.mtrl-sci); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A key challenge in artificial intelligence is the creation of systems capable of autonomously advancing scientific understanding by exploring novel domains, identifying complex patterns, and uncovering previously unseen connections in vast scientific data. In this work, we present SciAgents, an approach that leverages three core concepts: (1) the use of large-scale ontological knowledge graphs to organize and interconnect diverse scientific concepts, (2) a suite of large language models (LLMs) and data retrieval tools, and (3) multi-agent systems with in-situ learning capabilities. Applied to biologically inspired materials, SciAgents reveals hidden interdisciplinary relationships that were previously considered unrelated, achieving a scale, precision, and exploratory power that surpasses traditional human-driven research methods. The framework autonomously generates and refines research hypotheses, elucidating underlying mechanisms, design principles, and unexpected material properties. By integrating these capabilities in a modular fashion, the intelligent system yields material discoveries, critique and improve existing hypotheses, retrieve up-to-date data about existing research, and highlights their strengths and limitations. Our case studies demonstrate scalable capabilities to combine generative AI, ontological representations, and multi-agent modeling, harnessing a `swarm of intelligence’ similar to biological systems. This provides new avenues for materials discovery and accelerates the development of advanced materials by unlocking Nature’s design principles.
摘要:人工智能的一个关键挑战是创造能够通过探索新领域、识别复杂模式和揭示海量科学数据中以前未见的联系来自主推进科学理解的系统。在这项工作中,我们提出了SciAgents,一种利用三个核心概念的方法:(1)使用大规模本体知识图来组织和互联不同的科学概念,(2)一套大型语言模型(LLM)和数据检索工具,以及(3)具有现场学习能力的多代理系统。应用于生物启发的材料,SciAgents揭示了隐藏的跨学科关系,这些关系以前被认为是无关的,实现了超越传统人类驱动的研究方法的规模、精度和探索能力。该框架自主地生成和提炼研究假设,阐明潜在的机制、设计原则和意想不到的材料特性。通过以模块化方式集成这些功能,智能系统可以产生材料发现、批评和改进现有假设、检索有关现有研究的最新数据,并突出其优势和局限性。我们的案例研究展示了将生成性人工智能、本体表示和多代理建模相结合的可扩展能力,利用了类似于生物系统的“一群智能”。这为材料发现提供了新的途径,并通过解锁自然的设计原理加速了先进材料的发展。

[NLP-14] QiBERT – Classifying Online Conversations Messages with BERT as a Feature
[NLP-14] QiBERT --以BERT为功能对在线对话消息进行分类

链接: https://arxiv.org/abs/2409.05530
作者: Bruno D. Ferreira-Saraiva,Zuil Pirola,João P. Matos-Carvalho,Manuel Marques-Pita
关键词-EN: Recent developments, usage in everyday, everyday life, life have caused, caused an explosion
关键词-ZH: 最近的发展,在日常生活中的使用,生活造成了,造成了爆炸
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent developments in online communication and their usage in everyday life have caused an explosion in the amount of a new genre of text data, short text. Thus, the need to classify this type of text based on its content has a significant implication in many areas. Online debates are no exception, once these provide access to information about opinions, positions and preferences of its users. This paper aims to use data obtained from online social conversations in Portuguese schools (short text) to observe behavioural trends and to see if students remain engaged in the discussion when stimulated. This project used the state of the art (SoA) Machine Learning (ML) algorithms and methods, through BERT based models to classify if utterances are in or out of the debate subject. Using SBERT embeddings as a feature, with supervised learning, the proposed model achieved results above 0.95 average accuracy for classifying online messages. Such improvements can help social scientists better understand human communication, behaviour, discussion and persuasion.
摘要:最近在线交流的发展及其在日常生活中的使用导致了一种新的文本数据类型–短文本–的数量激增。因此,根据其内容对这类文本进行分类的需要在许多领域具有重要意义。在线辩论也不例外,一旦这些辩论提供了有关其用户的观点、立场和偏好的信息。本文旨在使用葡萄牙学校在线社交对话(短文本)获得的数据来观察学生的行为趋势,并观察学生在受到刺激时是否仍参与讨论。这个项目使用了最先进的(SoA)机器学习(ML)算法和方法,通过基于ERT的模型来分类话语是在辩论主题中还是在辩论主题之外。以SBERT嵌入为特征,结合有监督学习,该模型对在线消息的平均分类准确率达到0.95以上。这样的改进可以帮助社会科学家更好地理解人类的交流、行为、讨论和说服。

[NLP-15] Harmonic Reasoning in Large Language Models
[NLP-15] 大型语言模型中的和谐推理

链接: https://arxiv.org/abs/2409.05521
作者: Anna Kruspe
关键词-EN: Large Language Models, Large Language, including creative tasks, Language Models, including creative
关键词-ZH: 大型语言模型,大型语言,包括创意任务,语言模型,包括创意
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are becoming very popular and are used for many different purposes, including creative tasks in the arts. However, these models sometimes have trouble with specific reasoning tasks, especially those that involve logical thinking and counting. This paper looks at how well LLMs understand and reason when dealing with musical tasks like figuring out notes from intervals and identifying chords and scales. We tested GPT-3.5 and GPT-4o to see how they handle these tasks. Our results show that while LLMs do well with note intervals, they struggle with more complicated tasks like recognizing chords and scales. This points out clear limits in current LLM abilities and shows where we need to make them better, which could help improve how they think and work in both artistic and other complex areas. We also provide an automatically generated benchmark data set for the described tasks.
摘要:大型语言模型(LLM)变得非常受欢迎,并用于许多不同的目的,包括艺术中的创意任务。然而,这些模型有时在特定的推理任务上会遇到麻烦,尤其是那些涉及逻辑思维和计数的任务。本文探讨了法学硕士在处理音乐任务(例如计算出小节中的音符以及识别和弦和音阶)时的理解和推理能力。我们测试了GPT-3.5和GPT-4 o,以了解它们如何处理这些任务。我们的结果表明,虽然LLM在音符间隔方面表现出色,但它们在识别和弦和音阶等更复杂的任务上却遇到了困难。这指出了当前法学硕士能力的明显局限性,并表明了我们需要在哪些方面提高他们的能力,这可能有助于改善他们在艺术和其他复杂领域的思维和工作方式。我们还为所描述的任务提供自动生成的基准数据集。

[NLP-16] Elsevier Arena: Human Evaluation of Chemistry/Biology/Health Foundational Large Language Models
[NLP-16] 爱思唯尔竞技场:化学/生物/健康基础大型语言模型的人类评估

链接: https://arxiv.org/abs/2409.05486
作者: Camilo Thorne,Christian Druckenbrodt,Kinga Szarkowska,Deepika Goyal,Pranita Marajan,Vijay Somanath,Corey Harper,Mao Yan,Tony Scerri
关键词-EN: assessed with automated, quality and capabilities, fully assessed, benchmark evaluations, Abstract
关键词-ZH: 通过自动化、质量和能力进行评估、全面评估、基准评估、摘要
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 tables, 6 figures

点击查看摘要

Abstract:The quality and capabilities of large language models cannot be currently fully assessed with automated, benchmark evaluations. Instead, human evaluations that expand on traditional qualitative techniques from natural language generation literature are required. One recent best-practice consists in using A/B-testing frameworks, which capture preferences of human evaluators for specific models. In this paper we describe a human evaluation experiment focused on the biomedical domain (health, biology, chemistry/pharmacology) carried out at Elsevier. In it a large but not massive (8.8B parameter) decoder-only foundational transformer trained on a relatively small (135B tokens) but highly curated collection of Elsevier datasets is compared to OpenAI’s GPT-3.5-turbo and Meta’s foundational 7B parameter Llama 2 model against multiple criteria. Results indicate – even if IRR scores were generally low – a preference towards GPT-3.5-turbo, and hence towards models that possess conversational abilities, are very large and were trained on very large datasets. But at the same time, indicate that for less massive models training on smaller but well-curated training sets can potentially give rise to viable alternatives in the biomedical domain.
摘要:大型语言模型的质量和能力目前不能通过自动化的基准评估来完全评估。取而代之的是,需要从自然语言生成文献中扩展传统定性技术的人类评估。最近的一个最佳实践是使用A/B测试框架,该框架捕获人类评估者对特定模型的偏好。在这篇文章中,我们描述了在爱思唯尔进行的一项专注于生物医学领域(健康、生物、化学/药理学)的人体评估实验。在它中,对相对较小(135B令牌)但高度精选的Elsevier数据集集合进行训练的大型但不大(8.8B参数)解码器专用基本转换器与OpenAI的GPT-3.5-Turbo和Meta的基本7B参数Llama 2模型进行了多标准比较。结果表明–即使IRR分数通常很低–对GPT-3.5-Turbo的偏好,因此也倾向于具有会话能力的模型,这些模型非常大,并且在非常大的数据集上进行了训练。但与此同时,表明对于规模较小的模型,在较小但精心策划的训练集上进行训练可能会在生物医学领域产生可行的替代方案。

[NLP-17] Representational Analysis of Binding in Large Language Models
[NLP-17] 大型语言模型中绑定的表示分析

链接: https://arxiv.org/abs/2409.05448
作者: Qin Dai,Benjamin Heinzerling,Kentaro Inui
关键词-EN: Box, complex reasoning, essential for complex, Entity tracking, Entity
关键词-ZH: 盒子,复杂推理,复杂必不可少,实体跟踪,实体
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Entity tracking is essential for complex reasoning. To perform in-context entity tracking, language models (LMs) must bind an entity to its attribute (e.g., bind a container to its content) to recall attribute for a given entity. For example, given a context mentioning The coffee is in Box Z, the stone is in Box M, the map is in Box H'', to infer Box Z contains the coffee’’ later, LMs must bind Box Z'' to coffee’‘. To explain the binding behaviour of LMs, Feng and Steinhardt (2023) introduce a Binding ID mechanism and state that LMs use a abstract concept called Binding ID (BI) to internally mark entity-attribute pairs. However, they have not directly captured the BI determinant information from entity activations. In this work, we provide a novel view of the Binding ID mechanism by localizing the prototype of BI information. Specifically, we discover that there exists a low-rank subspace in the hidden state (or activation) of LMs, that primarily encodes the order of entity and attribute and which is used as the prototype of BI to causally determine the binding. To identify this subspace, we choose principle component analysis as our first attempt and it is empirically proven to be effective. Moreover, we also discover that when editing representations along directions in the subspace, LMs tend to bind a given entity to other attributes accordingly. For example, by patching activations along the BI encoding direction we can make the LM to infer Box Z contains the stone'' and Box Z contains the map’'.
摘要:实体跟踪是复杂推理的基础。要执行上下文中的实体跟踪,语言模型(LMS)必须将实体绑定到其属性(例如,将容器绑定到其内容)以调用给定实体的属性。例如,给出一个上下文,其中提到“咖啡在框Z中,石头在框M中,映射在框H中”,为了在以后推断出“框Z包含咖啡”,LMS必须将“框Z”绑定到“咖啡”。为了解释LMS的绑定行为,冯和Steinhardt(2023)引入了绑定ID机制,并指出LMS使用一个称为绑定ID(BI)的抽象概念来内部标记实体-属性对。然而,它们并没有直接从实体激活中捕获BI决定性信息。在这项工作中,我们通过本地化BI信息的原型提供了一个关于绑定ID机制的新视图。具体地说,我们发现在LMS的隐藏状态(或激活)中存在一个低秩子空间,它主要编码实体和属性的顺序,并作为BI的原型来因果确定绑定。为了识别这个子空间,我们选择主成分分析作为我们的第一次尝试,并通过实验证明它是有效的。此外,我们还发现,当沿子空间中的方向编辑表示时,LMS往往会相应地将给定实体绑定到其他属性。例如,通过沿着BI编码方向对激活进行修补,我们可以使LM推论出“盒子Z包含宝石”和“盒子Z包含贴图”。

[NLP-18] STLM Engineering Report: Dropout
[NLP-18] STLM工程报告:辍学

链接: https://arxiv.org/abs/2409.05423
作者: Dylan Hillier,Leon Guertler,Bobby Cheng,Cheston Tan
关键词-EN: modern language models, work we explore, models, high quality datasets, modern language
关键词-ZH: 现代语言模型、我们探索的工作、模型、高质量数据集、现代语言
类目: Computation and Language (cs.CL)
备注: 6 pages, 3 figures, For code base see this https URL

点击查看摘要

Abstract:In this work we explore the relevance of dropout for modern language models, particularly in the context of models on the scale of 100M parameters. We explore it’s relevance firstly in the regime of improving the sample efficiency of models given small, high quality datasets, and secondly in the regime of improving the quality of its fit on larger datasets where models may underfit. We find that concordant with conventional wisdom, dropout remains effective in the overfitting scenario, and that furthermore it may have some relevance for improving the fit of models even in the case of excess data, as suggested by previous research. In the process we find that the existing explanation for the mechanism behind this performance gain is not applicable in the case of language modelling.
摘要:在这项工作中,我们探索了辍学与现代语言模型的相关性,特别是在1亿个参数规模的模型的背景下。我们首先在提高小、高质量数据集模型的样本效率的机制中探索它的相关性,其次在提高模型在模型可能不适合的较大数据集上的匹配质量的机制中探索它的相关性。我们发现,与传统观点一致,退出在过拟适场景中仍然有效,而且即使在数据过多的情况下,它也可能对改善模型的拟适性有一定的相关性,正如之前的研究所表明的那样。在此过程中,我们发现对这种性能提高背后机制的现有解释不适用于语言建模的情况。

[NLP-19] NLLB-E5: A Scalable Multilingual Retrieval Model
[NLP-19] NLLB-E5:可扩展的多语言检索模型

链接: https://arxiv.org/abs/2409.05401
作者: Arkadeep Acharya,Rudra Murthy,Vishwajeet Kumar,Jaydeep Sen
关键词-EN: effectively supporting multiple, remains a critical, significant progress, capable of effectively, effectively supporting
关键词-ZH: 有效支持多个,仍然是一个关键、重大的进步,能够有效、有效地支持
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite significant progress in multilingual information retrieval, the lack of models capable of effectively supporting multiple languages, particularly low-resource like Indic languages, remains a critical challenge. This paper presents NLLB-E5: A Scalable Multilingual Retrieval Model. NLLB-E5 leverages the in-built multilingual capabilities in the NLLB encoder for translation tasks. It proposes a distillation approach from multilingual retriever E5 to provide a zero-shot retrieval approach handling multiple languages, including all major Indic languages, without requiring multilingual training data. We evaluate the model on a comprehensive suite of existing benchmarks, including Hindi-BEIR, highlighting its robust performance across diverse languages and tasks. Our findings uncover task and domain-specific challenges, providing valuable insights into the retrieval performance, especially for low-resource languages. NLLB-E5 addresses the urgent need for an inclusive, scalable, and language-agnostic text retrieval model, advancing the field of multilingual information access and promoting digital inclusivity for millions of users globally.
摘要:尽管在多语言信息检索方面取得了很大进展,但缺乏能够有效支持多语言的模型,特别是像印度语这样的低资源语言,仍然是一个严峻的挑战。本文提出了一种可扩展的多语言检索模型NLLB-E5。NLLB-E5利用了NLLB编码器中内置的多语言功能来执行翻译任务。它从多语言检索器E5中提出了一种蒸馏方法,提供了一种处理包括所有主要印度语语言在内的多语言的零命中检索方法,而不需要多语言训练数据。我们在包括印地语-拜尔在内的一套全面的现有基准上评估了该模型,强调了它在不同语言和任务中的稳健表现。我们的发现揭示了任务和领域特定的挑战,为检索性能提供了有价值的见解,特别是对于低资源语言。NLLB-E5解决了对包容性、可伸缩性和语言无关的文本检索模型的迫切需求,推动了多语言信息获取领域的发展,并促进了全球数百万用户的数字包容性。

[NLP-20] owards Building a Robust Knowledge Intensive Question Answering Model with Large Language Models NLPCC-2024
[NLP-20] owards使用大型语言模型构建稳健的知识密集型问题解答模型

链接: https://arxiv.org/abs/2409.05385
作者: Hong Xingyun Hong,Shao Yan Shao,Wang Zhilin Wang,Duan Manni Duan,Jin Xiongnan
关键词-EN: question answering, utilize external information, greatly enhanced, enhanced the intelligence, intelligence and fluency
关键词-ZH: 问答,利用外部信息,大大增强,增强了智力、智力和流畅性
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This paper has been accepted by NLPCC-2024

点击查看摘要

Abstract:The development of LLMs has greatly enhanced the intelligence and fluency of question answering, while the emergence of retrieval enhancement has enabled models to better utilize external information. However, the presence of noise and errors in retrieved information poses challenges to the robustness of LLMs. In this work, to evaluate the model’s performance under multiple interferences, we first construct a dataset based on machine reading comprehension datasets simulating various scenarios, including critical information absence, noise, and conflicts. To address the issue of model accuracy decline caused by noisy external information, we propose a data augmentation-based fine-tuning method to enhance LLM’s robustness against noise. Additionally, contrastive learning approach is utilized to preserve the model’s discrimination capability of external information. We have conducted experiments on both existing LLMs and our approach, the results are evaluated by GPT-4, which indicates that our proposed methods improve model robustness while strengthening the model’s discrimination capability.
摘要:LLMS的发展极大地提高了问答的智能性和流畅性,而检索增强的出现使模型能够更好地利用外部信息。然而,在检索到的信息中存在噪声和错误,这对LLMS的稳健性提出了挑战。在这项工作中,为了评估模型在多种干扰下的性能,我们首先构建了一个基于机器阅读理解数据集的数据集,模拟了包括关键信息缺失、噪声和冲突在内的各种场景。针对含有噪声的外部信息导致模型精度下降的问题,提出了一种基于数据增强的微调方法,以增强LLM对噪声的鲁棒性。此外,采用对比学习的方法来保持模型对外部信息的识别能力。我们在现有的LLMS和我们的方法上进行了实验,结果用GPT-4进行了评估,这表明我们提出的方法在增强模型识别能力的同时提高了模型的稳健性。

[NLP-21] Application Specific Compression of Deep Learning Models KDD
[NLP-21] 深度学习模型的特定应用压缩

链接: https://arxiv.org/abs/2409.05368
作者: Rohit Raj Rai,Angana Borah,Amit Awekar
关键词-EN: Large Deep Learning, Deep Learning model, Deep Learning, current Deep Learning, Learning model compression
关键词-ZH: 大型深度学习、深度学习模型、深度学习、当前深度学习、学习模型压缩
类目: Computation and Language (cs.CL)
备注: Accepted in the Proceedings of the 8th Joint International Conference on Data Science Management of Data (12th ACM IKDD CODS and 30th COMAD) for the Short Research Paper track, 5 pages

点击查看摘要

Abstract:Large Deep Learning models are compressed and deployed for specific applications. However, current Deep Learning model compression methods do not utilize the information about the target application. As a result, the compressed models are application agnostic. Our goal is to customize the model compression process to create a compressed model that will perform better for the target application. Our method, Application Specific Compression (ASC), identifies and prunes components of the large Deep Learning model that are redundant specifically for the given target application. The intuition of our work is to prune the parts of the network that do not contribute significantly to updating the data representation for the given application. We have experimented with the BERT family of models for three applications: Extractive QA, Natural Language Inference, and Paraphrase Identification. We observe that customized compressed models created using ASC method perform better than existing model compression methods and off-the-shelf compressed models.
摘要:大型深度学习模型针对特定应用进行了压缩和部署。然而,目前的深度学习模型压缩方法没有利用关于目标应用的信息。因此,压缩模型与应用程序无关。我们的目标是定制模型压缩过程,以创建更适合目标应用程序的压缩模型。我们的方法,特定于应用程序的压缩(ASC),识别并修剪大型深度学习模型中特定于给定目标应用程序的冗余组件。我们工作的直觉是修剪网络中对更新给定应用程序的数据表示没有重大贡献的部分。我们已经试验了Bert系列模型的三个应用:提取问答、自然语言推理和释义识别。我们观察到,使用ASC方法创建的定制压缩模型比现有的模型压缩方法和现成的压缩模型执行得更好。

[NLP-22] Diagnostic Reasoning in Natural Language: Computational Model and Application
[NLP-22] 自然语言诊断推理:计算模型与应用

链接: https://arxiv.org/abs/2409.05367
作者: Nils Dycke,Matej Zečević,Ilia Kuznetsov,Beatrix Suess,Kristian Kersting,Iryna Gurevych
关键词-EN: key component, component of expert, expert work, Diagnostic reasoning, diagnostic abductive reasoning
关键词-ZH: 关键组件,专家组件,专家工作,诊断推理,诊断回溯推理
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Diagnostic reasoning is a key component of expert work in many domains. It is a hard, time-consuming activity that requires expertise, and AI research has investigated the ways automated systems can support this process. Yet, due to the complexity of natural language, the applications of AI for diagnostic reasoning to language-related tasks are lacking. To close this gap, we investigate diagnostic abductive reasoning (DAR) in the context of language-grounded tasks (NL-DAR). We propose a novel modeling framework for NL-DAR based on Pearl’s structural causal models and instantiate it in a comprehensive study of scientific paper assessment in the biomedical domain. We use the resulting dataset to investigate the human decision-making process in NL-DAR and determine the potential of LLMs to support structured decision-making over text. Our framework, open resources and tools lay the groundwork for the empirical study of collaborative diagnostic reasoning in the age of LLMs, in the scholarly domain and beyond.
摘要:诊断推理是许多领域专家工作的重要组成部分。这是一项困难、耗时的活动,需要专业知识,人工智能研究已经调查了自动化系统如何支持这一过程。然而,由于自然语言的复杂性,人工智能在语言相关任务中的诊断推理应用还很少。为了缩小这一差距,我们在基于语言的任务(NL-DAR)的背景下研究了诊断溯因推理(DAR)。基于珀尔的结构因果模型,我们提出了一种新的自然语言-DAR建模框架,并将其应用于生物医学领域科技论文评价的综合研究中。我们使用得到的数据集来研究NL-DAR中的人类决策过程,并确定LLMS支持文本结构化决策的潜力。我们的框架、开放的资源和工具为LLMS时代、学术领域和其他领域的协作诊断推理的实证研究奠定了基础。

[NLP-23] IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS
[NLP-23] IndicVoices-R:解锁大规模多语言多说话者语音库,用于扩展印度TTC

链接: https://arxiv.org/abs/2409.05356
作者: Ashwin Sankar,Srija Anand,Praveen Srinivasa Varadhan,Sherry Thomas,Mehak Singal,Shridhar Kumar,Deovrat Mehendale,Aditi Krishana,Giri Raju,Mitesh Khapra
关键词-EN: highly natural-sounding output, Recent advancements, produce highly natural-sounding, Indian, Indian languages
关键词-ZH: 听起来非常自然的输出,最近的进步,产生了听起来非常自然的印度语,印度语言
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Recent advancements in text-to-speech (TTS) synthesis show that large-scale models trained with extensive web data produce highly natural-sounding output. However, such data is scarce for Indian languages due to the lack of high-quality, manually subtitled data on platforms like LibriVox or YouTube. To address this gap, we enhance existing large-scale ASR datasets containing natural conversations collected in low-quality environments to generate high-quality TTS training data. Our pipeline leverages the cross-lingual generalization of denoising and speech enhancement models trained on English and applied to Indian languages. This results in IndicVoices-R (IV-R), the largest multilingual Indian TTS dataset derived from an ASR dataset, with 1,704 hours of high-quality speech from 10,496 speakers across 22 Indian languages. IV-R matches the quality of gold-standard TTS datasets like LJSpeech, LibriTTS, and IndicTTS. We also introduce the IV-R Benchmark, the first to assess zero-shot, few-shot, and many-shot speaker generalization capabilities of TTS models on Indian voices, ensuring diversity in age, gender, and style. We demonstrate that fine-tuning an English pre-trained model on a combined dataset of high-quality IndicTTS and our IV-R dataset results in better zero-shot speaker generalization compared to fine-tuning on the IndicTTS dataset alone. Further, our evaluation reveals limited zero-shot generalization for Indian voices in TTS models trained on prior datasets, which we improve by fine-tuning the model on our data containing diverse set of speakers across language families. We open-source all data and code, releasing the first TTS model for all 22 official Indian languages.
摘要:文本到语音(TTS)合成的最新进展表明,用大量的网络数据训练的大规模模型可以产生听起来非常自然的输出。然而,由于Librivox或YouTube等平台上缺乏高质量的手动字幕数据,印度语言的此类数据很少。为了弥补这一差距,我们增强了现有的包含在低质量环境中收集的自然对话的大规模ASR数据集,以生成高质量的TTS训练数据。我们的流程利用了针对英语进行培训并应用于印度语言的去噪和语音增强模型的跨语言泛化。这导致了IndicVoices-R(IV-R),这是从ASR数据集派生的最大的多语言印度TTS数据集,从22种印度语言的10,496名说话者那里获得了1704小时的高质量语音。IV-R与LJSpeech、LibriTTS和IndicTTS等黄金标准TTS数据集的质量相当。我们还引入了IV-R基准,这是第一个评估TTS模型对印度声音的零镜头、少镜头和多镜头扬声器泛化能力的基准,确保了年龄、性别和风格的多样性。我们证明,与仅在IndicTTS数据集上进行微调相比,在高质量IndicTTS和我们的IV-R数据集的组合数据集上微调英语预训练模型可以获得更好的零发音说话人泛化。此外,我们的评估显示,在基于先前数据集训练的TTS模型中,印度语音的零射泛化有限,我们通过微调包含不同语言家族不同说话者集的数据来改进模型。我们将所有数据和代码开源,发布了所有22种官方印度语言的第一个TTS模型。

[NLP-24] Mpox Narrative on Instagram: A Labeled Multilingual Dataset of Instagram Posts on Mpox for Sentiment Hate Speech and Anxiety Analysis
[NLP-24] Instagram上的Mpox叙述:Mpox上Instagram帖子的标签多语言数据集,用于情绪仇恨言论和焦虑分析

链接: https://arxiv.org/abs/2409.05292
作者: Nirmalya Thakur
关键词-EN: Public Health Emergency, Public Health, Health Emergency, Emergency of International, International Concern
关键词-ZH: 突发公共卫生事件,公共卫生,突发卫生事件,国际紧急事件,国际关注
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:The world is currently experiencing an outbreak of mpox, which has been declared a Public Health Emergency of International Concern by WHO. No prior work related to social media mining has focused on the development of a dataset of Instagram posts about the mpox outbreak. The work presented in this paper aims to address this research gap and makes two scientific contributions to this field. First, it presents a multilingual dataset of 60,127 Instagram posts about mpox, published between July 23, 2022, and September 5, 2024. The dataset, available at this https URL, contains Instagram posts about mpox in 52 languages. For each of these posts, the Post ID, Post Description, Date of publication, language, and translated version of the post (translation to English was performed using the Google Translate API) are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis, hate speech detection, and anxiety or stress detection were performed. This process included classifying each post into (i) one of the sentiment classes, i.e., fear, surprise, joy, sadness, anger, disgust, or neutral, (ii) hate or not hate, and (iii) anxiety/stress detected or no anxiety/stress detected. These results are presented as separate attributes in the dataset. Second, this paper presents the results of performing sentiment analysis, hate speech analysis, and anxiety or stress analysis. The variation of the sentiment classes - fear, surprise, joy, sadness, anger, disgust, and neutral were observed to be 27.95%, 2.57%, 8.69%, 5.94%, 2.69%, 1.53%, and 50.64%, respectively. In terms of hate speech detection, 95.75% of the posts did not contain hate and the remaining 4.25% of the posts contained hate. Finally, 72.05% of the posts did not indicate any anxiety/stress, and the remaining 27.95% of the posts represented some form of anxiety/stress.
摘要:目前全球正在经历一场由世界卫生组织宣布为国际关注的突发公共卫生事件的疫情。之前没有一项与社交媒体挖掘相关的工作集中于开发关于mpox爆发的Instagram帖子的数据集。本文的工作旨在填补这一研究空白,并对这一领域做出了两项科学贡献。首先,它提供了一个多语言数据集,其中包括2022年7月23日至2024年9月5日期间发布的关于mpox的60,127条Instagram帖子。该数据集可从该HTTPS URL获得,其中包含52种语言的关于mpox的Instagram帖子。对于这些帖子中的每一个,帖子ID、帖子描述、发布日期、语言和帖子的翻译版本(使用Google Translate API执行到英语的翻译)在数据集中作为单独的属性呈现。在开发了这个数据集之后,进行了情绪分析、仇恨语音检测以及焦虑或压力检测。这一过程包括将每个帖子分类为(I)一种情绪类别,即恐惧、惊讶、喜悦、悲伤、愤怒、厌恶或中性,(Ii)仇恨或不仇恨,以及(Iii)检测到焦虑/压力或未检测到焦虑/压力。这些结果在数据集中显示为单独的属性。其次,本文给出了情感分析、仇恨言语分析和焦虑或压力分析的结果。恐惧、惊讶、喜悦、悲伤、愤怒、厌恶、中性等情绪类别的变异系数分别为27.95%、2.57%、8.69%、5.94%、2.69%、1.53%、50.64%。在仇恨言论检测方面,95.75%的帖子不包含仇恨,其余4.25%的帖子包含仇恨。最后,72.05%的帖子没有表示任何焦虑/压力,其余27.95%的帖子表示某种形式的焦虑/压力。

[NLP-25] Seek and Solve Reasoning for Table Question Answering
[NLP-25] 表题问答的寻求和解决推理

链接: https://arxiv.org/abs/2409.05286
作者: Ruya Jiang,Chun Wang,Weihong Deng
关键词-EN: Table-based Question Answering, involves answering questions, answering questions based, Large Language Models, involves answering
关键词-ZH: 基于表格的问题解答,涉及回答问题,基于问题的回答,大型语言模型,涉及回答
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Table-based Question Answering (TQA) involves answering questions based on tabular data. The complexity of table structures and question logic makes this task difficult even for Large Language Models (LLMs). This paper improves TQA performance by leveraging LLMs’ reasoning capabilities. Inspired by how humans solve TQA tasks, we propose a Seek-and-Solve pipeline that instructs the LLM to first seek relevant information and then answer questions. The two stages are integrated at the reasoning level, and their Chain of Thought (CoT) paths are integrated into a coherent Seek-and-Solve CoT (SS-CoT). Furthermore, we present a compact single-stage TQA-solving prompt distilled from the pipeline. Experiments demonstrate that under In-Context Learning settings, using samples with SS-CoT paths as demonstrations, the TQA-solving prompt can effectively guide the LLM to solve complex TQA tasks, resulting in improved performance and reliability. Our results highlight the importance of properly eliciting LLMs’ reasoning capabilities in solving complex TQA tasks.
摘要:基于表格的问答(TQA)是指根据表格数据回答问题。表结构和问题逻辑的复杂性使这项任务即使对于大型语言模型(LLM)也很困难。本文利用LLMS的推理能力来提高TQA的性能。受人类解决TQA任务的启发,我们提出了一种寻找和解决管道,指示LLM首先寻找相关信息,然后回答问题。这两个阶段在推理层被集成在一起,它们的思想链(COT)路径被集成到一个连贯的搜索和求解COT(SS-COT)中。此外,我们还提出了从流水线中提取的紧凑的单阶段TQA求解提示。实验表明,在情景学习环境下,以SS-CoT路径为例,TQA求解提示可以有效地指导LLM求解复杂的TQA任务,从而提高了性能和可靠性。我们的结果强调了在解决复杂的TQA任务时适当地诱导LLMS的推理能力的重要性。

[NLP-26] On the Relationship between Truth and Political Bias in Language Models
[NLP-26] 论语言模型中真理与政治偏见的关系

链接: https://arxiv.org/abs/2409.05283
作者: Suyash Fulay,William Brannon,Shrestha Mohanty,Cassandra Overney,Elinor Poole-Dayan,Deb Roy,Jad Kabbara
关键词-EN: Language model alignment, helpful and harmless, truthful and unbiased, research often attempts, attempts to ensure
关键词-ZH: 语言模型对齐,有帮助且无害,真实且公正,研究经常尝试,试图确保
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language model alignment research often attempts to ensure that models are not only helpful and harmless, but also truthful and unbiased. However, optimizing these objectives simultaneously can obscure how improving one aspect might impact the others. In this work, we focus on analyzing the relationship between two concepts essential in both language model alignment and political science: \textittruthfulness and \textitpolitical bias. We train reward models on various popular truthfulness datasets and subsequently evaluate their political bias. Our findings reveal that optimizing reward models for truthfulness on these datasets tends to result in a left-leaning political bias. We also find that existing open-source reward models (i.e. those trained on standard human preference datasets) already show a similar bias and that the bias is larger for larger models. These results raise important questions about both the datasets used to represent truthfulness and what language models capture about the relationship between truth and politics.
摘要:语言模型对齐研究往往试图确保模型不仅是有益和无害的,而且是真实和公正的。然而,同时优化这些目标可能会掩盖改进一个方面可能会对其他方面产生的影响。在这项工作中,我们重点分析了语言模式一致性和政治学中的两个基本概念之间的关系:\文本真实性和\文本政治偏见。我们在各种流行的真实性数据集上训练奖励模型,并随后评估他们的政治偏见。我们的发现表明,在这些数据集上优化奖励模型的真实性往往会导致左倾的政治偏见。我们还发现,现有的开源奖励模型(即那些在标准人类偏好数据集上训练的模型)已经显示出类似的偏差,并且对于更大的模型,这种偏差更大。这些结果提出了一些重要的问题,既涉及用于表示真实性的数据集,也涉及语言模型捕捉到的真相与政治之间的关系。

[NLP-27] RexUniNLU: Recursive Method with Explicit Schema Instructor for Universal NLU
[NLP-27] RexUniNLU:通用NLU具有显式模式的回归方法讲师

链接: https://arxiv.org/abs/2409.05275
作者: Chengyuan Liu,Shihang Wang,Fubang Zhao,Kun Kuang,Yangyang Kang,Weiming Lu,Changlong Sun,Fei Wu
关键词-EN: Text Classification, universal information extraction, CLS, Information Extraction, fundamental pillars
关键词-ZH: 文本分类、通用信息提取、CLS、信息提取、基本支柱
类目: Computation and Language (cs.CL)
备注: arXiv admin note: substantial text overlap with arXiv:2304.14770

点击查看摘要

Abstract:Information Extraction (IE) and Text Classification (CLS) serve as the fundamental pillars of NLU, with both disciplines relying on analyzing input sequences to categorize outputs into pre-established schemas. However, there is no existing encoder-based model that can unify IE and CLS tasks from this perspective. To fully explore the foundation shared within NLU tasks, we have proposed a Recursive Method with Explicit Schema Instructor for Universal NLU. Specifically, we firstly redefine the true universal information extraction (UIE) with a formal formulation that covers almost all extraction schemas, including quadruples and quintuples which remain unsolved for previous UIE models. Then, we expands the formulation to all CLS and multi-modal NLU tasks. Based on that, we introduce RexUniNLU, an universal NLU solution that employs explicit schema constraints for IE and CLS, which encompasses all IE and CLS tasks and prevent incorrect connections between schema and input sequence. To avoid interference between different schemas, we reset the position ids and attention mask matrices. Extensive experiments are conducted on IE, CLS in both English and Chinese, and multi-modality, revealing the effectiveness and superiority. Our codes are publicly released.
摘要:信息抽取(IE)和文本分类(CLS)是自然语言理解的基本支柱,这两个学科都依赖于对输入序列的分析来将输出归类为预先建立的模式。然而,目前还没有一个基于编码器的模型可以从这个角度统一IE和CLS任务。为了充分挖掘自然语言理解任务之间共享的基础,我们提出了一种具有显式模式指导的通用自然语言理解递归方法。具体地说,我们首先用一个形式化的形式重新定义了真正的通用信息抽取(UIE),它涵盖了几乎所有的抽取模式,包括以前的UIE模型中尚未解决的四元组和五元组。然后,我们将该公式扩展到所有CLS任务和多模式NLU任务。在此基础上,我们提出了RexUniNLU,这是一个通用的NLU解决方案,它对IE和CLS采用了显式的模式约束,它涵盖了所有IE和CLS任务,并防止了模式和输入序列之间的错误连接。为了避免不同模式之间的干扰,我们重新设置了位置ID和注意掩码矩阵。对IE、中英文CLS和多情态进行了广泛的实验,表明了该方法的有效性和优越性。我们的代码是公开发布的。

[NLP-28] UPCS: Unbiased Persona Construction for Dialogue Generation
[NLP-28] UPCS:对话生成的公正女神像构建

链接: https://arxiv.org/abs/2409.05257
作者: Kuiyun Chen,Yanbin Wei
关键词-EN: enhance personalized interactions, utilize persona profiles, personalized interactions, dialogue and storytelling, enhance personalized
关键词-ZH: 增强个性化互动,利用角色配置文件、个性化互动、对话和讲故事,增强个性化
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Narrative systems, such as dialogue and storytelling systems, often utilize persona profiles to enhance personalized interactions. Existing persona profiles frequently exhibit biases, posing risks to system integrity and fairness. To address this, we introduce the UPCS framework, which categorizes character descriptions into eight dimensions, including bias mitigation strategies. Experimental results demonstrate UPCS’s superiority in accuracy, diversity, bias elimination, and user satisfaction, marking a significant advancement in persona construction for reliable narrative systems.
摘要:对话和讲故事系统等叙事系统通常利用角色配置文件来增强个性化交互。现有的角色配置文件经常表现出偏见,对系统完整性和公平性构成风险。为了解决这个问题,我们引入了UPCS框架,该框架将角色描述分为八个维度,包括偏见缓解策略。实验结果证明了UPCS在准确性、多样性、偏见消除和用户满意度方面的优势,标志着可靠叙事系统角色构建方面的重大进步。

[NLP-29] Socially Responsible Data for Large Multilingual Language Models
[NLP-29] 大型多语言语言模型的社会责任数据

链接: https://arxiv.org/abs/2409.05247
作者: Andrew Smart,Ben Hutchinson,Lameck Mbangula Amugongo,Suzanne Dikker,Alex Zito,Amber Ebinama,Zara Wudiri,Ding Wang,Erin van Liemt,João Sedoc,Seyi Olojo,Stanley Uwakwe,Edem Wornyo,Sonja Schmer-Galunder,Jamila Smith-Loud
关键词-EN: largely English text, Large Language Models, English text, largely English, Large Language
关键词-ZH: 主要是英语文本,大型语言模型,英语文本,主要是英语,大型语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have rapidly increased in size and apparent capabilities in the last three years, but their training data is largely English text. There is growing interest in multilingual LLMs, and various efforts are striving for models to accommodate languages of communities outside of the Global North, which include many languages that have been historically underrepresented in digital realms. These languages have been coined as “low resource languages” or “long-tail languages”, and LLMs performance on these languages is generally poor. While expanding the use of LLMs to more languages may bring many potential benefits, such as assisting cross-community communication and language preservation, great care must be taken to ensure that data collection on these languages is not extractive and that it does not reproduce exploitative practices of the past. Collecting data from languages spoken by previously colonized people, indigenous people, and non-Western languages raises many complex sociopolitical and ethical questions, e.g., around consent, cultural safety, and data sovereignty. Furthermore, linguistic complexity and cultural nuances are often lost in LLMs. This position paper builds on recent scholarship, and our own work, and outlines several relevant social, cultural, and ethical considerations and potential ways to mitigate them through qualitative research, community partnerships, and participatory design approaches. We provide twelve recommendations for consideration when collecting language data on underrepresented language communities outside of the Global North.
摘要:在过去的三年中,大型语言模型的规模和表观能力迅速增加,但它们的训练数据大多是英文文本。人们对多语言LLM的兴趣与日俱增,各种努力正在努力寻找模式,以适应全球北方以外社区的语言,其中包括许多在数字领域历史上被低估的语言。这些语言被称为“低资源语言”或“长尾语言”,而LLMS在这些语言上的性能通常很差。虽然将LLMS的使用扩大到更多的语言可能会带来许多潜在的好处,如协助跨社区交流和保存语言,但必须非常小心地确保关于这些语言的数据收集不是摘录的,也不是重复过去的剥削做法。从以前被殖民的人、土著人民和非西方语言使用的语言收集数据会引发许多复杂的社会政治和伦理问题,例如,关于同意、文化安全和数据主权。此外,语言的复杂性和文化的细微差别往往在LLMS中被忽略。这份立场文件建立在最近的学术研究和我们自己的工作基础上,并概述了几个相关的社会、文化和伦理考虑因素,以及通过定性研究、社区合作和参与式设计方法来缓解这些问题的潜在方法。我们提供了12条建议,以供在收集全球北部以外未被充分代表的语言社区的语言数据时考虑。

[NLP-30] Exploring Intrinsic Language-specific Subspaces in Fine-tuning Multilingual Neural Machine Translation
[NLP-30] 在微调多语言神经机器翻译中探索内在语言特定子空间

链接: https://arxiv.org/abs/2409.05224
作者: Zhe Cao,Zhi Qu,Hidetaka Kamigaito,Taro Watanabe
关键词-EN: Multilingual neural machine, neural machine translation, machine translation models, translation models support, Multilingual neural
关键词-ZH: 多语言神经机器,神经机器翻译,机器翻译模型,翻译模型支持,多语言神经
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multilingual neural machine translation models support fine-tuning hundreds of languages simultaneously. However, fine-tuning on full parameters solely is inefficient potentially leading to negative interactions among languages. In this work, we demonstrate that the fine-tuning for a language occurs in its intrinsic language-specific subspace with a tiny fraction of entire parameters. Thus, we propose language-specific LoRA to isolate intrinsic language-specific subspaces. Furthermore, we propose architecture learning techniques and introduce a gradual pruning schedule during fine-tuning to exhaustively explore the optimal setting and the minimal intrinsic subspaces for each language, resulting in a lightweight yet effective fine-tuning procedure. The experimental results on a 12-language subset and a 30-language subset of FLORES-101 show that our methods not only outperform full-parameter fine-tuning up to 2.25 spBLEU scores but also reduce trainable parameters to 0.4% for high and medium-resource languages and 1.6% for low-resource ones.
摘要:多语言神经机器翻译模型支持同时微调数百种语言。然而,仅对完整参数进行微调是低效的,可能会导致语言之间的负面交互。在这项工作中,我们证明了对一种语言的微调发生在其固有的特定于语言的子空间中,只有很小一部分完整的参数。因此,我们提出了特定于语言的LORA来隔离固有的特定于语言的子空间。此外,我们提出了体系结构学习技术,并在微调过程中引入了逐步剪枝计划,以详尽地探索每种语言的最优设置和最小内在子空间,从而产生一个轻量级但有效的微调过程。在Flores-101的12种语言子集和30种语言子集上的实验结果表明,我们的方法不仅比全参数微调性能高达2.25spBLEU分数,而且将高、中资源语言的可训练参数减少到0.4%,而对于低资源语言则减少到1.6%。

[NLP-31] Interactive Machine Teaching by Labeling Rules and Instances ACL2024
[NLP-31] 通过标签规则和收件箱进行交互式机器教学

链接: https://arxiv.org/abs/2409.05199
作者: Giannis Karamanolakis,Daniel Hsu,Luis Gravano
关键词-EN: aims to reduce, reduce the cost, rules, supervised learning aims, Weakly supervised
关键词-ZH: 旨在减少、降低成本、规则、监督学习目标、弱监督
类目: Computation and Language (cs.CL)
备注: Accepted to TACL 2024

点击查看摘要

Abstract:Weakly supervised learning aims to reduce the cost of labeling data by using expert-designed labeling rules. However, existing methods require experts to design effective rules in a single shot, which is difficult in the absence of proper guidance and tooling. Therefore, it is still an open question whether experts should spend their limited time writing rules or instead providing instance labels via active learning. In this paper, we investigate how to exploit an expert’s limited time to create effective supervision. First, to develop practical guidelines for rule creation, we conduct an exploratory analysis of diverse collections of existing expert-designed rules and find that rule precision is more important than coverage across datasets. Second, we compare rule creation to individual instance labeling via active learning and demonstrate the importance of both across 6 datasets. Third, we propose an interactive learning framework, INTERVAL, that achieves efficiency by automatically extracting candidate rules based on rich patterns (e.g., by prompting a language model), and effectiveness by soliciting expert feedback on both candidate rules and individual instances. Across 6 datasets, INTERVAL outperforms state-of-the-art weakly supervised approaches by 7% in F1. Furthermore, it requires as few as 10 queries for expert feedback to reach F1 values that existing active learning methods cannot match even with 100 queries.
摘要:弱监督学习旨在利用专家设计的标注规则来降低标注数据的代价。然而,现有的方法需要专家一次就能设计出有效的规则,如果没有适当的指导和工具,这是困难的。因此,专家是应该花有限的时间编写规则,还是应该通过主动学习来提供实例标签,这仍然是一个悬而未决的问题。在本文中,我们研究了如何利用专家的有限时间来创建有效的监督。首先,为了制定实用的规则创建指南,我们对现有专家设计的规则的不同集合进行了探索性分析,发现规则精度比数据集的覆盖率更重要。其次,我们通过主动学习将规则创建与单个实例标注进行比较,并在6个数据集上演示了两者的重要性。第三,我们提出了一个交互式学习框架Interval,它通过基于丰富模式(例如,通过提示语言模型)自动提取候选规则来实现效率,并通过征求专家对候选规则和单个实例的反馈来实现有效性。在6个数据集中,Interval在F1中的表现比最先进的弱监督方法高出7%。此外,它只需要10个查询就能得到专家反馈,达到F1值,而现有的主动学习方法即使在100个查询的情况下也无法匹配。

[NLP-32] Seemingly Plausible Distractors in Multi-Hop Reasoning: Are Large Language Models Attentive Readers?
[NLP-32] 多跳推理中看似合理的干扰因素:大型语言模型是细心的读者吗?

链接: https://arxiv.org/abs/2409.05197
作者: Neeladri Bhuiya,Viktor Schlegel,Stefan Winkler
关键词-EN: Large Language Models, possessing scientific knowledge, Large Language, multi-hop reasoning, ranging from reading
关键词-ZH: 大型语言模型,拥有科学知识,大型语言,多跳推理,从阅读
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 3 figures

点击查看摘要

Abstract:State-of-the-art Large Language Models (LLMs) are accredited with an increasing number of different capabilities, ranging from reading comprehension, over advanced mathematical and reasoning skills to possessing scientific knowledge. In this paper we focus on their multi-hop reasoning capability: the ability to identify and integrate information from multiple textual sources. Given the concerns with the presence of simplifying cues in existing multi-hop reasoning benchmarks, which allow models to circumvent the reasoning requirement, we set out to investigate, whether LLMs are prone to exploiting such simplifying cues. We find evidence that they indeed circumvent the requirement to perform multi-hop reasoning, but they do so in more subtle ways than what was reported about their fine-tuned pre-trained language model (PLM) predecessors. Motivated by this finding, we propose a challenging multi-hop reasoning benchmark, by generating seemingly plausible multi-hop reasoning chains, which ultimately lead to incorrect answers. We evaluate multiple open and proprietary state-of-the-art LLMs, and find that their performance to perform multi-hop reasoning is affected, as indicated by up to 45% relative decrease in F1 score when presented with such seemingly plausible alternatives. We conduct a deeper analysis and find evidence that while LLMs tend to ignore misleading lexical cues, misleading reasoning paths indeed present a significant challenge. Comments: 16 pages, 3 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) ACMclasses: I.2.7 Cite as: arXiv:2409.05197 [cs.CL] (or arXiv:2409.05197v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.05197 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:最先进的大型语言模型(LLM)被认为具有越来越多的不同能力,从阅读理解、高级数学和推理技能到拥有科学知识。在本文中,我们重点研究它们的多跳推理能力:识别和整合来自多个文本来源的信息的能力。鉴于现有的多跳推理基准中存在简化线索的担忧,这允许模型规避推理要求,我们开始调查LLM是否倾向于利用这种简化线索。我们发现有证据表明,他们确实绕过了执行多跳推理的要求,但他们做到这一点的方式比报道的经过微调的预训练语言模型(PLM)的前辈更微妙。受这一发现的启发,我们提出了一个具有挑战性的多跳推理基准,通过生成看似合理的多跳推理链,最终导致不正确的答案。我们评估了多个开放的和专有的最先进的LLM,发现它们执行多跳推理的性能受到了影响,当提供这样看似合理的替代方案时,F1分数的相对下降高达45%。我们进行了更深入的分析,发现有证据表明,尽管LLM倾向于忽略误导性的词汇线索,但误导性的推理路径确实构成了一个巨大的挑战。评论:16页,3位数字主题:计算和语言(cs.CL);人工智能(cs.AI)ACM类:I.2.7引用AS:arxiv:2409.05197cs.CLhttps://doi.org/10.48550/arXiv.2409.05197 Focus通过DataCite了解更多arxiv发布的DOI(待注册)

[NLP-33] OneGen: Efficient One-Pass Unified Generation and Retrieval for LLMs
[NLP-33] OneGen:针对LLM的高效一遍统一生成和检索

链接: https://arxiv.org/abs/2409.05152
作者: Jintian Zhang,Cheng Peng,Mengshu Sun,Xiang Chen,Lei Liang,Zhiqiang Zhang,Jun Zhou,Huajun Chen,Ningyu Zhang
关键词-EN: Large Language Models, Language Models, Large Language, advancements in Large, directly handling retrieval
关键词-ZH: 大型语言模型,语言模型,大型语言,大型进步,直接处理检索
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Work in progress; code is available at this https URL

点击查看摘要

Abstract:Despite the recent advancements in Large Language Models (LLMs), which have significantly enhanced the generative capabilities for various NLP tasks, LLMs still face limitations in directly handling retrieval tasks. However, many practical applications demand the seamless integration of both retrieval and generation. This paper introduces a novel and efficient One-pass Generation and retrieval framework (OneGen), designed to improve LLMs’ performance on tasks that require both generation and retrieval. The proposed framework bridges the traditionally separate training approaches for generation and retrieval by incorporating retrieval tokens generated autoregressively. This enables a single LLM to handle both tasks simultaneously in a unified forward pass. We conduct experiments on two distinct types of composite tasks, RAG and Entity Linking, to validate the pluggability, effectiveness, and efficiency of OneGen in training and inference. Furthermore, our results show that integrating generation and retrieval within the same context preserves the generative capabilities of LLMs while improving retrieval performance. To the best of our knowledge, OneGen is the first to enable LLMs to conduct vector retrieval during the generation.
摘要:尽管最近大型语言模型(LLMS)的发展显著增强了各种自然语言处理任务的生成能力,但LLMS在直接处理检索任务方面仍然面临限制。然而,许多实际应用需要检索和生成的无缝集成。本文介绍了一种新颖高效的一遍生成和检索框架(OneGen),旨在提高LLMS在同时需要生成和检索的任务上的性能。该框架通过结合自回归生成的检索标记,为生成和检索的传统训练方法架起了桥梁。这使得单个LLM能够在统一的前向传递中同时处理这两个任务。我们在RAG和实体链接两种不同类型的复合任务上进行了实验,以验证OneGen在训练和推理方面的可插拔性、有效性和效率。此外,我们的结果表明,在相同的上下文中整合生成和检索,在保持LLMS的生成能力的同时,提高了检索性能。据我们所知,OneGen是第一个使LLMS能够在生成期间进行矢量检索的公司。

[NLP-34] Better Spanish Emotion Recognition In-the-wild: Bringing Attention to Deep Spectrum Voice Analysis
[NLP-34] 更好的野外西班牙情感识别:关注深频谱语音分析

链接: https://arxiv.org/abs/2409.05148
作者: Elena Ortega-Beltrán,Josep Cabacas-Maso,Ismael Benito-Altamirano,Carles Ventura
关键词-EN: Socially Assistive Robots, Socially Assistive, Assistive Robots, key development factor, user emotional state
关键词-ZH: 社交辅助机器人,社交辅助,辅助机器人,关键发展因素,用户情绪状态
类目: ound (cs.SD); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Within the context of creating new Socially Assistive Robots, emotion recognition has become a key development factor, as it allows the robot to adapt to the user’s emotional state in the wild. In this work, we focused on the analysis of two voice recording Spanish datasets: ELRA-S0329 and EmoMatchSpanishDB. Specifically, we centered our work in the paralanguage, e.~g. the vocal characteristics that go along with the message and clarifies the meaning. We proposed the use of the DeepSpectrum method, which consists of extracting a visual representation of the audio tracks and feeding them to a pretrained CNN model. For the classification task, DeepSpectrum is often paired with a Support Vector Classifier --DS-SVC–, or a Fully-Connected deep-learning classifier --DS-FC–. We compared the results of the DS-SVC and DS-FC architectures with the state-of-the-art (SOTA) for ELRA-S0329 and EmoMatchSpanishDB. Moreover, we proposed our own classifier based upon Attention Mechanisms, namely DS-AM. We trained all models against both datasets, and we found that our DS-AM model outperforms the SOTA models for the datasets and the SOTA DeepSpectrum architectures. Finally, we trained our DS-AM model in one dataset and tested it in the other, to simulate real-world conditions on how biased is the model to the dataset.
摘要:在创造新的社交辅助机器人的背景下,情绪识别已经成为一个关键的开发因素,因为它允许机器人适应用户在野外的情绪状态。在这项工作中,我们重点分析了两个西班牙语语音记录数据集:ELRA-S0329和EmoMatch西班牙文数据库。具体地说,我们的工作集中在副语言上,例如,伴随着信息的发声特征,并澄清了含义。我们提出了使用深度谱方法,该方法包括提取音频轨道的视觉表示并将其提供给预先训练的CNN模型。对于分类任务,DeepSpectrum通常与支持向量分类器(DS-SVC)或完全连接的深度学习分类器(DS-FC)配对。我们将DS-SVC和DS-FC架构的结果与ELRA-S0329和EmoMatchSweishDB的最新技术(SOTA)进行了比较。此外,我们还提出了自己的基于注意力机制的分类器,即DS-AM。我们针对这两个数据集对所有模型进行了训练,发现我们的DS-AM模型的性能优于针对数据集的SOTA模型和SOTA DeepSpectrum体系结构。最后,我们在一个数据集中训练了我们的DS-AM模型,并在另一个数据集中对其进行了测试,以模拟真实世界中模型对数据集的偏差程度。

[NLP-35] READoc: A Unified Benchmark for Realistic Document Structured Extraction
[NLP-35] READoc:真实文档结构化提取的统一基准

链接: https://arxiv.org/abs/2409.05137
作者: Zichao Li,Aizier Abulaiti,Yaojie Lu,Xuanang Chen,Jia Zheng,Hongyu Lin,Xianpei Han,Le Sun
关键词-EN: Document Structured Extraction, extract structured content, Structured Extraction, extract structured, structured content
关键词-ZH: 文档结构化提取,提取结构化内容,结构化提取,提取结构化,结构化内容
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Document Structured Extraction (DSE) aims to extract structured content from raw documents. Despite the emergence of numerous DSE systems, their unified evaluation remains inadequate, significantly hindering the field’s advancement. This problem is largely attributed to existing benchmark paradigms, which exhibit fragmented and localized characteristics. To address these limitations and offer a thorough evaluation of DSE systems, we introduce a novel benchmark named READoc, which defines DSE as a realistic task of converting unstructured PDFs into semantically rich Markdown. The READoc dataset is derived from 2,233 diverse and real-world documents from arXiv and GitHub. In addition, we develop a DSE Evaluation S ^3 uite comprising Standardization, Segmentation and Scoring modules, to conduct a unified evaluation of state-of-the-art DSE approaches. By evaluating a range of pipeline tools, expert visual models, and general VLMs, we identify the gap between current work and the unified, realistic DSE objective for the first time. We aspire that READoc will catalyze future research in DSE, fostering more comprehensive and practical solutions.
摘要:文档结构化抽取的目的是从原始文档中抽取结构化内容。尽管出现了许多DSE系统,但其统一评价仍然不够充分,严重阻碍了该领域的进步。这一问题在很大程度上归因于现有的基准范式,它们表现出碎片化和本地化的特点。为了解决这些限制并提供对DSE系统的全面评估,我们引入了一个名为READoc的新基准测试,它将DSE定义为将非结构化PDF转换为语义丰富的Markdown的现实任务。Readoc数据集来自Arxiv和GitHub的2,233个不同的现实世界文档。此外,我们还开发了一个包含标准化、细分和评分模块的数字搜索引擎评估S^3,以对最先进的数字搜索引擎方法进行统一评估。通过评估一系列管道工具、专家可视化模型和通用VLM,我们首次确定了当前工作与统一的、现实的DSE目标之间的差距。我们希望Readoc将促进未来对DSE的研究,促进更全面和实用的解决方案。

[NLP-36] MHS-STMA: Multimodal Hate Speech Detection via Scalable Transformer-Based Multilevel Attention Framework
[NLP-36] MHS-STMA:通过基于可扩展转换器的多层注意力框架进行多模式仇恨语音检测

链接: https://arxiv.org/abs/2409.05136
作者: Anusha Chhabra,Dinesh Kumar Vishwakarma
关键词-EN: Social media, people lives, significant impact, impact on people, Social
关键词-ZH: 社交媒体,人们的生活,重大影响,对人们的影响,社交
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Social media has a significant impact on people’s lives. Hate speech on social media has emerged as one of society’s most serious issues recently. Text and pictures are two forms of multimodal data distributed within articles. Unimodal analysis has been the primary emphasis of earlier approaches. Additionally, when doing multimodal analysis, researchers neglect to preserve the distinctive qualities associated with each modality. The present article suggests a scalable architecture for multimodal hate content detection called transformer-based multilevel attention (STMA) to address these shortcomings. This architecture consists of three main parts: a combined attention-based deep learning mechanism, a vision attention mechanism encoder, and a caption attention-mechanism encoder. To identify hate content, each component uses various attention processes and uniquely handles multimodal data. Several studies employing multiple assessment criteria on three hate speech datasets: Hateful memes, MultiOff, and MMHS150K, validate the suggested architecture’s efficacy. The outcomes demonstrate that on all three datasets, the suggested strategy performs better than the baseline approaches.
摘要:社交媒体对人们的生活产生了重大影响。最近,社交媒体上的仇恨言论已经成为社会最严重的问题之一。文本和图片是分布在文章中的两种形式的多模式数据。单峰分析一直是早期方法的主要重点。此外,在进行多模式分析时,研究人员忽略了保留与每种模式相关的独特品质。本文提出了一种可扩展的多模式仇恨内容检测体系结构,称为基于转换器的多级注意(STMA),以解决这些缺点。该体系结构由三个主要部分组成:基于注意的深度学习机制、视觉注意机制编码器和字幕注意机制编码器。为了识别仇恨内容,每个组件使用各种注意过程,并唯一地处理多模式数据。几项研究使用了三个仇恨语音数据集的多个评估标准:仇恨模因、MultiOff和MMHS150K,验证了所建议的体系结构的有效性。结果表明,在所有三个数据集上,建议的策略都比基准方法执行得更好。

[NLP-37] Hate Content Detection via Novel Pre-Processing Sequencing and Ensemble Methods
[NLP-37] 通过新型预处理测序和Ensemble方法检测仇恨内容

链接: https://arxiv.org/abs/2409.05134
作者: Anusha Chhabra,Dinesh Kumar Vishwakarma
关键词-EN: Social media, hate speech, identifying hate speech, significant increase, increase in incidents
关键词-ZH: 社交媒体、仇恨言论、识别仇恨言论、显着增加、事件增加
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Social media, particularly Twitter, has seen a significant increase in incidents like trolling and hate speech. Thus, identifying hate speech is the need of the hour. This paper introduces a computational framework to curb the hate content on the web. Specifically, this study presents an exhaustive study of pre-processing approaches by studying the impact of changing the sequence of text pre-processing operations for the identification of hate content. The best-performing pre-processing sequence, when implemented with popular classification approaches like Support Vector Machine, Random Forest, Decision Tree, Logistic Regression and K-Neighbor provides a considerable boost in performance. Additionally, the best pre-processing sequence is used in conjunction with different ensemble methods, such as bagging, boosting and stacking to improve the performance further. Three publicly available benchmark datasets (WZ-LS, DT, and FOUNTA), were used to evaluate the proposed approach for hate speech identification. The proposed approach achieves a maximum accuracy of 95.14% highlighting the effectiveness of the unique pre-processing approach along with an ensemble classifier.
摘要:社交媒体,尤其是推特,看到了像巨魔和仇恨言论这样的事件显著增加。因此,识别仇恨言论是当务之急。本文介绍了一种遏制网络仇恨内容的计算框架。具体地说,本研究通过研究改变文本前处理操作的顺序对仇恨内容识别的影响,对前处理方法进行了详尽的研究。当使用流行的分类方法如支持向量机、随机森林、决策树、Logistic回归和K-Neighbor来实现时,性能最佳的预处理序列提供了相当大的性能提升。此外,最佳的预处理序列与不同的集成方法相结合,如装袋、提升和堆叠,以进一步提高性能。三个公开可用的基准数据集(WZ-LS、DT和FOUNTA)被用来评估所提出的仇恨言论识别方法。该方法的最高准确率达到95.14%,突出了独特的预处理方法和集成分类器的有效性。

[NLP-38] WaterSeeker: Efficient Detection of Watermarked Segments in Large Documents
[NLP-38] WaterSeeker:有效检测大型文档中的水印片段

链接: https://arxiv.org/abs/2409.05112
作者: Leyi Pan,Aiwei Liu,Yijian Lu,Zitian Gao,Yichen Di,Lijie Wen,Irwin King,Philip S. Yu
关键词-EN: Watermarking algorithms, large language models, detecting LLM-generated text, attained high accuracy, language models
关键词-ZH: 水印算法、大型语言模型、检测LLM生成的文本、达到高准确度、语言模型
类目: Computation and Language (cs.CL)
备注: 18 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Watermarking algorithms for large language models (LLMs) have attained high accuracy in detecting LLM-generated text. However, existing methods primarily focus on distinguishing fully watermarked text from non-watermarked text, overlooking real-world scenarios where LLMs generate only small sections within large documents. In this scenario, balancing time complexity and detection performance poses significant challenges. This paper presents WaterSeeker, a novel approach to efficiently detect and locate watermarked segments amid extensive natural text. It first applies an efficient anomaly extraction method to preliminarily locate suspicious watermarked regions. Following this, it conducts a local traversal and performs full-text detection for more precise verification. Theoretical analysis and experimental results demonstrate that WaterSeeker achieves a superior balance between detection accuracy and computational efficiency. Moreover, WaterSeeker’s localization ability supports the development of interpretable AI detection systems. This work pioneers a new direction in watermarked segment detection, facilitating more reliable AI-generated content identification.
摘要:针对大语言模型的数字水印算法在检测大语言模型生成的文本时取得了较高的准确率。然而,现有的方法主要集中在区分完全水印的文本和非水印的文本,而忽略了LLM只在大文档中生成小部分的真实场景。在这种情况下,平衡时间复杂性和检测性能会带来巨大的挑战。提出了一种在大量自然文本中有效检测和定位水印片段的新方法–WaterSeeker。首先采用一种有效的异常提取方法对可疑水印区域进行初步定位。之后,它执行本地遍历并执行全文检测以进行更精确的验证。理论分析和实验结果表明,该算法在检测精度和计算效率之间取得了较好的平衡。此外,WatSeeker的本地化能力支持可解释的AI检测系统的开发。这项工作开创了水印片段检测的新方向,有助于更可靠的人工智能生成的内容识别。

[NLP-39] EdaCSC: Two Easy Data Augmentation Methods for Chinese Spelling Correction
[NLP-39] EdaCSC:两种用于中文拼写纠正的简单数据增强方法

链接: https://arxiv.org/abs/2409.05105
作者: Lei Sheng,Shuai-Shuai Xu
关键词-EN: Chinese Spelling Correction, correct spelling errors, Spelling Correction, Chinese sentences caused, Chinese Spelling
关键词-ZH: 中文拼写纠正,纠正拼写错误,拼写纠正,引起的中文句子,中文拼写
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 2 figures

点击查看摘要

Abstract:Chinese Spelling Correction (CSC) aims to detect and correct spelling errors in Chinese sentences caused by phonetic or visual similarities. While current CSC models integrate pinyin or glyph features and have shown significant progress,they still face challenges when dealing with sentences containing multiple typos and are susceptible to overcorrection in real-world scenarios. In contrast to existing model-centric approaches, we propose two data augmentation methods to address these limitations. Firstly, we augment the dataset by either splitting long sentences into shorter ones or reducing typos in sentences with multiple typos. Subsequently, we employ different training processes to select the optimal model. Experimental evaluations on the SIGHAN benchmarks demonstrate the superiority of our approach over most existing models, achieving state-of-the-art performance on the SIGHAN15 test set.
摘要:中文拼写纠正(CSC)旨在检测并纠正中文句子中因语音或视觉相似性而引起的拼写错误。虽然当前的CSC模型集成了MIDI或MIDI功能并取得了重大进展,但在处理包含多个错别字的句子时仍然面临挑战,并且在现实世界场景中容易出现矫枉过正。与现有的以模型为中心的方法相比,我们提出了两种数据增强方法来解决这些限制。首先,我们通过将长句拆分为较短的句子或减少具有多个错别字的句子中的错别字来扩大数据集。随后,我们采用不同的训练过程来选择最佳模型。对SIGHAN基准的实验评估证明了我们的方法相对于大多数现有模型的优越性,在SIGHAN 15测试集中实现了最先进的性能。

[NLP-40] LLM-based Abstraction and Concretization for GUI Test Migration
[NLP-40] 基于LLM的图形用户界面测试迁移的抽象和具体化

链接: https://arxiv.org/abs/2409.05028
作者: Yakun Zhang,Chen Liu,Xiaofei Xie,Yun Lin,Jin Song Dong,Dan Hao,Lu Zhang
关键词-EN: GUI test, GUI test case, GUI test migration, test cases, test
关键词-ZH: 图形界面测试,图形界面测试,图形界面测试迁移,测试用例,测试
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:GUI test migration aims to produce test cases with events and assertions to test specific functionalities of a target app. Existing migration approaches typically focus on the widget-mapping paradigm that maps widgets from source apps to target apps. However, since different apps may implement the same functionality in different ways, direct mapping may result in incomplete or buggy test cases, thus significantly impacting the effectiveness of testing target functionality and the practical applicability. In this paper, we propose a new migration paradigm (i.e., abstraction-concretization paradigm) that first abstracts the test logic for the target functionality and then utilizes this logic to generate the concrete GUI test case. Furthermore, we introduce MACdroid, the first approach that migrates GUI test cases based on this paradigm. Specifically, we propose an abstraction technique that utilizes source test cases from source apps targeting the same functionality to extract a general test logic for that functionality. Then, we propose a concretization technique that utilizes the general test logic to guide an LLM in generating the corresponding GUI test case (including events and assertions) for the target app. We evaluate MACdroid on two widely-used datasets (including 31 apps, 34 functionalities, and 123 test cases). On the FrUITeR dataset, the test cases generated by MACdroid successfully test 64% of the target functionalities, improving the baselines by 191%. On the Lin dataset, MACdroid successfully tests 75% of the target functionalities, outperforming the baselines by 42%. These results underscore the effectiveness of MACdroid in GUI test migration. Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL) Cite as: arXiv:2409.05028 [cs.SE] (or arXiv:2409.05028v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2409.05028 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:图形用户界面测试迁移旨在生成带有事件和断言的测试用例,以测试目标应用程序的特定功能。现有的迁移方法通常专注于微件映射范例,即将微件从源应用程序映射到目标应用程序。然而,由于不同的应用程序可能会以不同的方式实现相同的功能,直接映射可能会导致测试用例不完整或出现错误,从而严重影响测试目标功能的有效性和实用性。在本文中,我们提出了一种新的迁移范式(即抽象-具体化范式),它首先抽象目标功能的测试逻辑,然后利用该逻辑生成具体的图形用户界面测试用例。此外,我们还引入了MACdroid,这是第一种基于该范例迁移图形用户界面测试用例的方法。具体地说,我们提出了一种抽象技术,该技术利用来自针对相同功能的源应用程序的源测试用例来提取该功能的通用测试逻辑。然后,我们提出了一种具体化技术,该技术利用通用测试逻辑来指导LLM为目标应用程序生成相应的图形用户界面测试用例(包括事件和断言)。我们在两个广泛使用的数据集(包括31个应用程序、34个功能和123个测试用例)上对MACdroid进行了评估。在果树数据集上,MACdroid生成的测试用例成功测试了%的目标功能,将基线提高了191%。在LIN数据集上,MACdroid成功测试了75%的目标功能,比基准性能高出42%。这些结果强调了MACdroid在图形用户界面测试迁移中的有效性。科目:软件工程(cs.SE);计算与语言(cs.CL)引用为:arxiv:2409.05028cs.SEhttps://doi.org/10.48550/arXiv.2409.05028 Focus通过DataCite了解更多arxiv发布的文档说明(待注册)

[NLP-41] Vision-fused Attack: Advancing Aggressive and Stealthy Adversarial Text against Neural Machine Translation IJCAI2024
[NLP-41] 视觉融合攻击:推进攻击性和隐形对抗文本对抗神经机器翻译

链接: https://arxiv.org/abs/2409.05021
作者: Yanni Xue,Haojie Hao,Jiakai Wang,Qiang Sheng,Renshuai Tao,Yu Liang,Pu Feng,Xianglong Liu
关键词-EN: neural machine translation, models achieve success, enhancing NMT models, machine translation, daily lives
关键词-ZH: 神经机器翻译、模型取得成功、增强NMT模型、机器翻译、日常生活
类目: Computation and Language (cs.CL)
备注: IJCAI 2024

点击查看摘要

Abstract:While neural machine translation (NMT) models achieve success in our daily lives, they show vulnerability to adversarial attacks. Despite being harmful, these attacks also offer benefits for interpreting and enhancing NMT models, thus drawing increased research attention. However, existing studies on adversarial attacks are insufficient in both attacking ability and human imperceptibility due to their sole focus on the scope of language. This paper proposes a novel vision-fused attack (VFA) framework to acquire powerful adversarial text, i.e., more aggressive and stealthy. Regarding the attacking ability, we design the vision-merged solution space enhancement strategy to enlarge the limited semantic solution space, which enables us to search for adversarial candidates with higher attacking ability. For human imperceptibility, we propose the perception-retained adversarial text selection strategy to align the human text-reading mechanism. Thus, the finally selected adversarial text could be more deceptive. Extensive experiments on various models, including large language models (LLMs) like LLaMA and GPT-3.5, strongly support that VFA outperforms the comparisons by large margins (up to 81%/14% improvements on ASR/SSIM).
摘要:神经机器翻译(NMT)模型在日常生活中取得成功的同时,也暴露出易受敌意攻击的问题。尽管这些攻击是有害的,但也为解释和增强NMT模型提供了好处,因此吸引了更多的研究关注。然而,现有的对抗性攻击研究由于只关注语言的范围,在攻击能力和人的隐蔽性方面都存在不足。提出了一种新的视觉融合攻击(VFA)框架,以获得更具攻击性和隐蔽性的强大敌意文本。在攻击能力方面,我们设计了视觉融合解空间增强策略来扩大有限的语义解空间,使我们能够搜索到攻击能力更强的对抗性候选。针对人类的不可感知性,我们提出了保留感知的对抗性文本选择策略来对齐人类的文本阅读机制。因此,最终选定的对抗性文本可能更具欺骗性。在各种模型上的广泛实验,包括大语言模型(LLM),如骆驼和GPT-3.5,有力地支持了VFA的性能远远超过比较(在ASR/SSIM上高达81%/14%的改进)。

[NLP-42] owards Patronizing and Condescending Language in Chinese Videos: A Multimodal Dataset and Detector ICASSP2025
[NLP-42] 中国视频中的傲慢和居高临下语言:多模式数据集和检测器

链接: https://arxiv.org/abs/2409.05005
作者: Hongbo Wang,Junyu Lu,Yan Han,Liang Yang,Hongfei Lin
关键词-EN: Patronizing and Condescending, Condescending Language, targeting vulnerable groups, toxic speech targeting, threatening both online
关键词-ZH: 光顾和居高临下,居高临下的语言,针对弱势群体,有毒言论,在网上威胁两者
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Under review in ICASSP 2025

点击查看摘要

Abstract:Patronizing and Condescending Language (PCL) is a form of discriminatory toxic speech targeting vulnerable groups, threatening both online and offline safety. While toxic speech research has mainly focused on overt toxicity, such as hate speech, microaggressions in the form of PCL remain underexplored. Additionally, dominant groups’ discriminatory facial expressions and attitudes toward vulnerable communities can be more impactful than verbal cues, yet these frame features are often overlooked. In this paper, we introduce the PCLMM dataset, the first Chinese multimodal dataset for PCL, consisting of 715 annotated videos from Bilibili, with high-quality PCL facial frame spans. We also propose the MultiPCL detector, featuring a facial expression detection module for PCL recognition, demonstrating the effectiveness of modality complementarity in this challenging task. Our work makes an important contribution to advancing microaggression detection within the domain of toxic speech.
摘要:居高临下的语言(PCL)是一种针对弱势群体的歧视性有毒言论,威胁着线上和线下的安全。虽然有毒言论的研究主要集中在公开的毒性,如仇恨言论,但PCL形式的微攻击仍未被探索。此外,占主导地位的群体对弱势群体的歧视性面部表情和态度可能比言语暗示更具影响力,但这些框架特征往往被忽视。在本文中,我们介绍了PCLMM数据集,这是中国第一个用于PCL的多模式数据集,由来自Bilibili的715个带注释的视频组成,具有高质量的PCL人脸帧跨度。我们还提出了多PCL检测器,该检测器具有用于PCL识别的面部表情检测模块,展示了通道互补在这一具有挑战性的任务中的有效性。我们的工作为推进有毒言论领域内的微攻击检测做出了重要贡献。

[NLP-43] InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference
[NLP-43] InstInfer:存储中注意力卸载,实现经济高效的长上下文LLM推理

链接: https://arxiv.org/abs/2409.04992
作者: Xiurui Pan,Endian Li,Qiao Li,Shengwen Liang,Yizhou Shan,Ke Zhou,Yingwei Luo,Xiaolin Wang,Jie Zhang
关键词-EN: Large Language Models, Large Language, widespread of Large, Language Models, milestone in generative
关键词-ZH: 大型语言模型,大型语言,大型语言的广泛,语言模型,生成式的里程碑
类目: Hardware Architecture (cs.AR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The widespread of Large Language Models (LLMs) marks a significant milestone in generative AI. Nevertheless, the increasing context length and batch size in offline LLM inference escalate the memory requirement of the key-value (KV) cache, which imposes a huge burden on the GPU VRAM, especially for resource-constraint scenarios (e.g., edge computing and personal devices). Several cost-effective solutions leverage host memory or SSDs to reduce storage costs for offline inference scenarios and improve the throughput. Nevertheless, they suffer from significant performance penalties imposed by intensive KV cache accesses due to limited PCIe bandwidth. To address these issues, we propose InstInfer, a novel LLM inference system that offloads the most performance-critical computation (i.e., attention in decoding phase) and data (i.e., KV cache) parts to Computational Storage Drives (CSDs), which minimize the enormous KV transfer overheads. InstInfer designs a dedicated flash-aware in-storage attention engine with KV cache management mechanisms to exploit the high internal bandwidths of CSDs instead of being limited by the PCIe bandwidth. The optimized P2P transmission between GPU and CSDs further reduces data migration overheads. Experimental results demonstrate that for a 13B model using an NVIDIA A6000 GPU, InstInfer improves throughput for long-sequence inference by up to 11.1 \times , compared to existing SSD-based solutions such as FlexGen.
摘要:大型语言模型(LLM)的广泛使用标志着生成式人工智能的一个重要里程碑。然而,离线LLM推理中不断增加的上下文长度和批处理大小增加了对键值(KV)缓存的存储需求,这给GPU VRAM带来了巨大的负担,特别是对于资源受限的场景(例如,边缘计算和个人设备)。几种经济高效的解决方案利用主机内存或SSD来降低离线推理场景的存储成本并提高吞吐量。然而,由于PCIe带宽有限,它们会受到密集KV缓存访问带来的显著性能损失。为了解决这些问题,我们提出了一种新颖的LLM推理系统InstInfer,它将最关键的计算(即解码阶段的注意力)和数据(即KV缓存)部分卸载到计算存储驱动器(CSD),从而将巨大的KV传输开销降至最低。InstInfer设计了具有KV缓存管理机制的专用闪存感知存储中注意引擎,以利用CSD的高内部带宽,而不是受PCIe带宽的限制。GPU和CSD之间优化的P2P传输进一步减少了数据迁移开销。实验结果表明,对于使用NVIDIA A6000 GPU的13B模型,与现有的基于固态硬盘的解决方案(如FlexGen)相比,InstInfer将长序列推理的吞吐量提高了11.1倍。

[NLP-44] Evaluation of Google Translate for Mandarin Chinese translation using sentiment and semantic analysis
[NLP-44] 使用情感和语义分析评估Google Translate的中文普通话翻译

链接: https://arxiv.org/abs/2409.04964
作者: Xuechun Wang,Rodney Beard,Rohitash Chandra
关键词-EN: significant global impact, making communication easier, Google Translate, global impact, large language models
关键词-ZH: 显着的全球影响,使沟通更容易,谷歌翻译,全球影响,大型语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Machine translation using large language models (LLMs) is having a significant global impact, making communication easier. Mandarin Chinese is the official language used for communication by the government, education institutes, and media in China. In this study, we provide an automated assessment of machine translation models with human experts using sentiment and semantic analysis. In order to demonstrate our framework, we select classic early twentieth-century novel ‘The True Story of Ah Q’ with selected Mandarin Chinese to English translations. We also us Google Translate to generate the given text into English and then conduct a chapter-wise sentiment analysis and semantic analysis to compare the extracted sentiments across the different translations. We utilise LLMs for semantic and sentiment analysis. Our results indicate that the precision of Google Translate differs both in terms of semantic and sentiment analysis when compared to human expert translations. We find that Google Translate is unable to translate some of the specific words or phrases in Chinese, such as Chinese traditional allusions. The mistranslations have to its lack of contextual significance and historical knowledge of China. Thus, this framework brought us some new insights about machine translation for Chinese Mandarin. The future work can explore other languages or types of texts with this framework.
摘要:使用大型语言模型(LLM)的机器翻译正在产生重大的全球影响,使交流变得更容易。普通话是中国政府、教育机构和媒体交流的官方语言。在这项研究中,我们提供了一个自动化的评估机器翻译模型与人类专家使用情感和语义分析。为了展示我们的框架,我们选择了二十世纪早期的经典小说《阿Q正传》,并精选了中英文翻译。我们还使用Google翻译将给定的文本生成为英语,然后进行逐章的情感分析和语义分析,以比较不同翻译中提取的情感。我们使用LLMS进行语义和情感分析。我们的结果表明,与人类专家翻译相比,谷歌翻译的精确度在语义和情感分析方面都有所不同。我们发现Google翻译无法翻译中文中的一些特定单词或短语,例如中国的传统典故。误译的原因是缺乏对中国的语境意义和历史知识。该框架为汉语普通话机器翻译带来了一些新的见解。未来的工作可以在这个框架下探索其他语言或类型的文本。

[NLP-45] Maximizing Relation Extraction Potential: A Data-Centric Study to Unveil Challenges and Opportunities
[NLP-45] 最大限度地发挥关系提取潜力:一项以数据为中心的研究揭示挑战和机遇

链接: https://arxiv.org/abs/2409.04934
作者: Anushka Swarup,Avanti Bhandarkar,Olivia P. Dizon-Paradis,Ronald Wilson,Damon L. Woodard
关键词-EN: Natural Language Processing, Processing task aiming, Language Processing task, Processing task, Natural Language
关键词-ZH: 自然语言处理,处理任务目标,语言处理任务,处理任务,自然语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Relation extraction is a Natural Language Processing task aiming to extract relationships from textual data. It is a critical step for information extraction. Due to its wide-scale applicability, research in relation extraction has rapidly scaled to using highly advanced neural networks. Despite their computational superiority, modern relation extractors fail to handle complicated extraction scenarios. However, a comprehensive performance analysis of the state-of-the-art relation extractors that compile these challenges has been missing from the literature, and this paper aims to bridge this gap. The goal has been to investigate the possible data-centric characteristics that impede neural relation extraction. Based on extensive experiments conducted using 15 state-of-the-art relation extraction algorithms ranging from recurrent architectures to large language models and seven large-scale datasets, this research suggests that modern relation extractors are not robust to complex data and relation characteristics. It emphasizes pivotal issues, such as contextual ambiguity, correlating relations, long-tail data, and fine-grained relation distributions. In addition, it sets a marker for future directions to alleviate these issues, thereby proving to be a critical resource for novice and advanced researchers. Efficient handling of the challenges described can have significant implications for the field of information extraction, which is a critical part of popular systems such as search engines and chatbots. Data and relevant code can be found at this https URL.
摘要:关系抽取是一项旨在从文本数据中抽取关系的自然语言处理任务。这是信息抽取的关键一步。由于其广泛的适用性,关系提取的研究已经迅速扩展到使用高度先进的神经网络。尽管现代关系抽取器在计算上具有优势,但它不能处理复杂的抽取场景。然而,文献中没有对汇编这些挑战的最先进的关系抽取器进行全面的性能分析,本文的目的是弥合这一差距。我们的目标是调查可能阻碍神经关系提取的以数据为中心的特征。基于15种最先进的关系提取算法(从递归体系结构到大型语言模型)和7个大规模数据集进行的广泛实验,该研究表明,现代关系抽取器对复杂数据和关系特征的健壮性不强。它强调关键问题,如上下文歧义、关联关系、长尾数据和细粒度关系分布。此外,它为缓解这些问题的未来方向设定了一个标志,从而被证明是新手和高级研究人员的关键资源。有效地处理所描述的挑战可能会对信息提取领域产生重大影响,信息提取是搜索引擎和聊天机器人等流行系统的关键部分。数据和相关代码可在此HTTPS URL中找到。

[NLP-46] Just ASR LLM? A Study on Speech Large Language Models Ability to Identify and Understand Speaker in Spoken Dialogue
[NLP-46] 只是ASB LLM?语音大语言模型识别和理解口语对话中说话人的能力研究

链接: https://arxiv.org/abs/2409.04927
作者: Junkai Wu,Xulin Fan,Bo-Ru Lu,Xilin Jiang,Nima Mesgarani,Mark Hasegawa-Johnson,Mari Ostendorf
关键词-EN: speech language models, recent years, English listening test, observed a rapid, rapid advancement
关键词-ZH: 言语语言模型,近年来,英语听力测试,观察到了迅速、迅速的进步
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to IEEE SLT 2024

点击查看摘要

Abstract:In recent years, we have observed a rapid advancement in speech language models (SpeechLLMs), catching up with humans’ listening and reasoning abilities. Remarkably, SpeechLLMs have demonstrated impressive spoken dialogue question-answering (SQA) performance in benchmarks like Gaokao, the English listening test of the college entrance exam in China, which seemingly requires understanding both the spoken content and voice characteristics of speakers in a conversation. However, after carefully examining Gaokao’s questions, we find the correct answers to many questions can be inferred from the conversation context alone without identifying the speaker asked in the question. Our evaluation of state-of-the-art models Qwen-Audio and WavLLM in both Gaokao and our proposed “What Do You Like?” dataset shows a significantly higher accuracy in these context-based questions than in identity-critical questions, which can only be answered correctly with correct speaker identification. Our results and analysis suggest that when solving SQA, the current SpeechLLMs exhibit limited speaker awareness from the audio and behave similarly to an LLM reasoning from the conversation transcription without sound. We propose that our definitions and automated classification of context-based and identity-critical questions could offer a more accurate evaluation framework of SpeechLLMs in SQA tasks.
摘要:近年来,我们观察到语音语言模型(SpeechLLMS)的快速发展,赶上了人类的听力和推理能力。值得注意的是,SpeechLLMS在高考等基准测试中的口语对话问答表现令人印象深刻。高考是中国高考的英语听力考试,似乎需要理解对话中说话者的口语内容和声音特征。然而,在仔细检查高考的问题后,我们发现许多问题的正确答案可以仅从对话上下文中推断出来,而不需要确定问题中的说话人。我们在高考中对最先进的Qwen-Audio和WavLLM模型的评估,以及我们提出的“你喜欢什么?”数据集显示,在这些基于上下文的问题中,准确率明显高于身份关键问题,后者只有在正确的说话人识别的情况下才能正确回答。我们的结果和分析表明,在解决SQA问题时,当前的SpeechLLM表现出从音频中对说话人的有限感知,并且其行为类似于从没有声音的对话转录中进行的LLM推理。我们建议,我们对基于上下文的问题和身份关键问题的定义和自动分类可以为SQA任务中的SpeechLLMS提供更准确的评估框架。

[NLP-47] Achieving Peak Performance for Large Language Models : A Systematic Review
[NLP-47] 实现大型语言模型的峰值性能:系统性回顾

链接: https://arxiv.org/abs/2409.04833
作者: Zhyar Rzgar K Rostam,Sándor Szénási,Gábor Kertész
关键词-EN: achieved remarkable success, natural language processing, achieved remarkable, remarkable success, success in natural
关键词-ZH: 取得了显着的成功,自然语言处理,取得了显着的、显着的成功,在自然领域取得了成功
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 34 pages, 7 figures, 8 tables. Journal Article: IEEE Access

点击查看摘要

Abstract:In recent years, large language models (LLMs) have achieved remarkable success in natural language processing (NLP). LLMs require an extreme amount of parameters to attain high performance. As models grow into the trillion-parameter range, computational and memory costs increase significantly. This makes it difficult for many researchers to access the resources needed to train or apply these models. Optimizing LLM performance involves two main approaches: fine-tuning pre-trained models for specific tasks to achieve state-of-the-art performance, and reducing costs or improving training time while maintaining similar performance. This paper presents a systematic literature review (SLR) following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement. We reviewed 65 publications out of 983 from 2017 to December 2023, retrieved from 5 databases. The study presents methods to optimize and accelerate LLMs while achieving cutting-edge results without sacrificing accuracy. We begin with an overview of the development of language modeling, followed by a detailed explanation of commonly used frameworks and libraries, and a taxonomy for improving and speeding up LLMs based on three classes: LLM training, LLM inference, and system serving. We then delve into recent optimization and acceleration strategies such as training optimization, hardware optimization, scalability and reliability, accompanied by the taxonomy and categorization of these strategies. Finally, we provide an in-depth comparison of each class and strategy, with two case studies on optimizing model training and enhancing inference efficiency. These case studies showcase practical approaches to address LLM resource limitations while maintaining performance.
摘要:近年来,大语言模型在自然语言处理中取得了显著的成功。LLM需要极大量的参数才能获得高性能。随着模型增长到万亿参数范围,计算和内存成本显著增加。这使得许多研究人员很难获得培训或应用这些模型所需的资源。优化LLM性能涉及两个主要方法:针对特定任务微调预先训练的模型以获得最先进的性能,以及在保持类似性能的同时降低成本或缩短培训时间。本文介绍了系统性综述和荟萃分析(PRISMA)陈述的首选报告项目之后的系统性文献综述(SLR)。我们回顾了2017年至2023年12月从5个数据库检索到的983篇出版物中的65篇。这项研究提出了优化和加速LLMS的方法,同时在不牺牲精度的情况下获得尖端结果。我们首先概述语言建模的发展,然后详细解释常用的框架和库,以及基于三个类别改进和加速LLM的分类:LLM培训、LLM推理和系统服务。然后,我们深入研究了最新的优化和加速策略,如训练优化、硬件优化、可扩展性和可靠性,并对这些策略进行了分类和分类。最后,我们对每个类别和策略进行了深入的比较,并给出了优化模型训练和提高推理效率的两个案例。这些案例研究展示了在保持性能的同时解决LLM资源限制的实用方法。

[NLP-48] MILE: A Mutation Testing Framework of In-Context Learning Systems
[NLP-48] MILE:上下文内学习系统的突变测试框架

链接: https://arxiv.org/abs/2409.04831
作者: Zeming Wei,Yihao Zhang,Meng Sun
关键词-EN: achieved notable success, large language models, achieved notable, notable success, applications of large
关键词-ZH: 取得了显着的成功,大型语言模型,取得了显着的成功,大型应用
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In-context Learning (ICL) has achieved notable success in the applications of large language models (LLMs). By adding only a few input-output pairs that demonstrate a new task, the LLM can efficiently learn the task during inference without modifying the model parameters. Such mysterious ability of LLMs has attracted great research interests in understanding, formatting, and improving the in-context demonstrations, while still suffering from drawbacks like black-box mechanisms and sensitivity against the selection of examples. In this work, inspired by the foundations of adopting testing techniques in machine learning (ML) systems, we propose a mutation testing framework designed to characterize the quality and effectiveness of test data for ICL systems. First, we propose several mutation operators specialized for ICL demonstrations, as well as corresponding mutation scores for ICL test sets. With comprehensive experiments, we showcase the effectiveness of our framework in evaluating the reliability and quality of ICL test suites. Our code is available at this https URL.
摘要:情境学习(ICL)在大型语言模型的应用中取得了显著的成功。通过只添加几个演示新任务的输入输出对,LLM可以在不修改模型参数的情况下在推理过程中有效地学习任务。LLMS这种神秘的能力在理解、格式化和改进情景演示方面吸引了极大的研究兴趣,但仍然存在黑箱机制和对样本选择的敏感性等缺陷。在这项工作中,受机器学习(ML)系统采用测试技术的基础启发,我们提出了一个突变测试框架,旨在表征ICL系统测试数据的质量和有效性。首先,我们提出了几种专门用于ICL演示的变异算子,以及相应的ICL测试集的变异分数。通过全面的实验,我们展示了我们的框架在评估ICL测试集的可靠性和质量方面的有效性。我们的代码可以在这个HTTPS URL上找到。

[NLP-49] Exploring Straightforward Conversational Red-Teaming
[NLP-49] 探索直接的对话式红色团队

链接: https://arxiv.org/abs/2409.04822
作者: George Kour,Naama Zwerdling,Marcel Zalmanovici,Ateret Anaby-Tavor,Ora Nova Fandina,Eitan Farchi
关键词-EN: Large language models, business dialogue systems, Large language, ethical risks, business dialogue
关键词-ZH: 大型语言模型、商业对话系统、大型语言、道德风险、商业对话
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in business dialogue systems but they pose security and ethical risks. Multi-turn conversations, where context influences the model’s behavior, can be exploited to produce undesired responses. In this paper, we examine the effectiveness of utilizing off-the-shelf LLMs in straightforward red-teaming approaches, where an attacker LLM aims to elicit undesired output from a target LLM, comparing both single-turn and conversational red-teaming tactics. Our experiments offer insights into various usage strategies that significantly affect their performance as red teamers. They suggest that off-the-shelf models can act as effective red teamers and even adjust their attack strategy based on past attempts, although their effectiveness decreases with greater alignment.
摘要:大型语言模型(LLM)越来越多地用于商业对话系统,但它们也带来了安全和道德风险。上下文影响模型行为的多轮对话可以被利用来产生不希望的响应。在本文中,我们研究了在简单的红色团队方法中利用现成的LLM的有效性,其中攻击者LLM旨在从目标LLM中获取不希望的输出,并比较了单回合和对话式红色团队策略。我们的实验深入了解了各种使用策略,这些策略显着影响他们作为红色团队成员的表现。他们认为,现成的模型可以充当有效的红色团队成员,甚至根据过去的尝试调整他们的攻击策略,尽管它们的有效性随着一致性的提高而降低。

[NLP-50] Phrase-Level Adversarial Training for Mitigating Bias in Neural Network-based Automatic Essay Scoring
[NLP-50] 用于减轻基于神经网络的论文自动评分中偏见的短语级对抗训练

链接: https://arxiv.org/abs/2409.04795
作者: Haddad Philip,Tsegaye Misikir Tashu
关键词-EN: Automatic Essay Scoring, Automatic Essay, educational purposes, candidates for educational, AES
关键词-ZH: 自动论文评分、自动论文、教育目的、教育候选人、AES
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Automatic Essay Scoring (AES) is widely used to evaluate candidates for educational purposes. However, due to the lack of representative data, most existing AES systems are not robust, and their scoring predictions are biased towards the most represented data samples. In this study, we propose a model-agnostic phrase-level method to generate an adversarial essay set to address the biases and robustness of AES models. Specifically, we construct an attack test set comprising samples from the original test set and adversarially generated samples using our proposed method. To evaluate the effectiveness of the attack strategy and data augmentation, we conducted a comprehensive analysis utilizing various neural network scoring models. Experimental results show that the proposed approach significantly improves AES model performance in the presence of adversarial examples and scenarios without such attacks.
摘要:自动论文评分(AES)广泛用于出于教育目的评估候选人。然而,由于缺乏代表性数据,大多数现有的AES系统并不稳健,并且它们的评分预测偏向最具代表性的数据样本。在这项研究中,我们提出了一种模型不可知的短语级方法来生成对抗性论文集,以解决AES模型的偏见和稳健性。具体来说,我们构建了一个攻击测试集,其中包括来自原始测试集的样本和使用我们提出的方法反向生成的样本。为了评估攻击策略和数据增强的有效性,我们利用各种神经网络评分模型进行了全面分析。实验结果表明,在没有此类攻击的情况下,所提出的方法显着提高了AES模型的性能。

[NLP-51] Selective Self-Rehearsal: A Fine-Tuning Approach to Improve Generalization in Large Language Models
[NLP-51] 选择性自我排练:一种改进大型语言模型中概括的微调方法

链接: https://arxiv.org/abs/2409.04787
作者: Sonam Gupta,Yatin Nandwani,Asaf Yehudai,Mayank Mishra,Gaurav Pandey,Dinesh Raghu,Sachindra Joshi
关键词-EN: Fine-tuning Large Language, Large Language Models, Large Language, Fine-tuning Large, Language Models
关键词-ZH: 微调大型语言,大型语言模型,大型语言,微调大型,语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 8 figures

点击查看摘要

Abstract:Fine-tuning Large Language Models (LLMs) on specific datasets is a common practice to improve performance on target tasks. However, this performance gain often leads to overfitting, where the model becomes too specialized in either the task or the characteristics of the training data, resulting in a loss of generalization. This paper introduces Selective Self-Rehearsal (SSR), a fine-tuning approach that achieves performance comparable to the standard supervised fine-tuning (SFT) while improving generalization. SSR leverages the fact that there can be multiple valid responses to a query. By utilizing the model’s correct responses, SSR reduces model specialization during the fine-tuning stage. SSR first identifies the correct model responses from the training set by deploying an appropriate LLM as a judge. Then, it fine-tunes the model using the correct model responses and the gold response for the remaining samples. The effectiveness of SSR is demonstrated through experiments on the task of identifying unanswerable queries across various datasets. The results show that standard SFT can lead to an average performance drop of up to 16.7% on multiple benchmarks, such as MMLU and TruthfulQA. In contrast, SSR results in close to 2% drop on average, indicating better generalization capabilities compared to standard SFT.
摘要:对特定数据集的大型语言模型(LLM)进行微调是提高目标任务性能的常见做法。然而,这种性能收益经常导致过度拟合,其中模型在任务或训练数据的特征上变得过于专门化,导致失去泛化。本文介绍了选择性自排练(SSR),这是一种微调方法,在提高泛化能力的同时获得了与标准监督微调(SFT)相当的性能。SSR利用了对一个查询可以有多个有效响应这一事实。通过利用模型的正确响应,SSR在微调阶段减少了模型专门化。SSR首先通过部署适当的LLM作为判断,从训练集中识别正确的模型响应。然后,它使用正确的模型响应和剩余样本的GOLD响应来微调模型。通过在不同数据集中识别无法回答的查询的实验,证明了SSR的有效性。结果表明,标准SFT在MMLU和TruthfulQA等多个基准测试上的平均性能降幅高达16.7%。相比之下,SSR的结果平均下降了近2%,这表明与标准SFT相比,SSR具有更好的泛化能力。

[NLP-52] LoCa: Logit Calibration for Knowledge Distillation ECAI2024
[NLP-52] LoCa:用于知识蒸馏的Logit校准

链接: https://arxiv.org/abs/2409.04778
作者: Runming Yang,Taiqiang Wu,Yujiu Yang
关键词-EN: aiming to train, plays an important, important role, model compression, teacher model
关键词-ZH: 旨在培训,发挥着重要非常重要的作用,模型压缩,教师模型
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by ECAI 2024

点击查看摘要

Abstract:Knowledge Distillation (KD), aiming to train a better student model by mimicking the teacher model, plays an important role in model compression. One typical way is to align the output logits. However, we find a common issue named mis-instruction, that the student would be misled when the predictions based on teacher logits do not follow the labels. Meanwhile, there is other useful dark knowledge in the logits such as the class discriminability, which is vital for distillation. In this paper, we propose a simple yet effective Logit Calibration (LoCa) method, which calibrates the logits from the teacher model based on the ground-truth labels. The key insight is to correct the prediction (to address the mis-instruction issue) and maintain useful dark knowledge simultaneously. Our proposed LoCa does not require any additional parameters. Empirical results on image classification and text generation tasks demonstrate that LoCa can effectively improve the performance of baselines.
摘要:知识蒸馏(KD)旨在通过模仿教师模型来训练更好的学生模型,在模型压缩中发挥着重要作用。一种典型的方法是对齐输出逻辑。然而,我们发现了一个常见的问题,即错误指导,即当基于教师日志的预测不遵循标签时,学生就会被误导。与此同时,逻辑中还有其他有用的暗知识,例如类可辨别性,这对于蒸馏至关重要。在本文中,我们提出了一种简单而有效的Logit校准(LoCa)方法,该方法基于地面真相标签从教师模型中校准Logit。关键的见解是纠正预测(以解决错误指导问题)并同时维护有用的暗知识。我们提出的LoCa不需要任何额外的参数。图像分类和文本生成任务的经验结果表明,LoCa可以有效提高基线的性能。

[NLP-53] Untie the Knots: An Efficient Data Augmentation Strategy for Long-Context Pre-Training in Language Models
[NLP-53] 解开结:语言模型长上下文预训练的有效数据增强策略

链接: https://arxiv.org/abs/2409.04774
作者: Junfeng Tian,Da Zheng,Yang Cheng,Rui Wang,Colin Zhang,Debing Zhang
关键词-EN: Large language models, Large language, prioritized expanding, Large, context window
关键词-ZH: 大型语言模型、大型语言、优先扩展、大型、上下文窗口
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLM) have prioritized expanding the context window from which models can incorporate more information. However, training models to handle long contexts presents significant challenges. These include the scarcity of high-quality natural long-context data, the potential for performance degradation on short-context tasks, and the reduced training efficiency associated with attention mechanisms. In this paper, we introduce Untie the Knots (\textbfUtK), a novel data augmentation strategy employed during the continue pre-training phase, designed to efficiently enable LLMs to gain long-context capabilities without the need to modify the existing data mixture. In particular, we chunk the documents, shuffle the chunks, and create a complex and knotted structure of long texts; LLMs are then trained to untie these knots and identify relevant segments within seemingly chaotic token sequences. This approach greatly improves the model’s performance by accurately attending to relevant information in long context and the training efficiency is also largely increased. We conduct extensive experiments on models with 7B and 72B parameters, trained on 20 billion tokens, demonstrating that UtK achieves 75% and 84.5% accurracy on RULER at 128K context length, significantly outperforming other long context strategies. The trained models will open-source for further research.
摘要:大型语言模型(LLM)优先考虑扩展上下文窗口,在该窗口中模型可以包含更多信息。然而,处理长背景的培训模式带来了巨大的挑战。这些问题包括缺乏高质量的自然长背景数据,短背景任务的性能可能下降,以及与注意力机制相关的训练效率降低。在本文中,我们介绍了一种在继续预训练阶段采用的新的数据增强策略–Untie the Knots(TextbfUtK),该策略旨在使LLMS在不需要修改现有数据混合的情况下有效地获得长上下文能力。特别是,我们将文档分块,洗牌,并创建一个复杂而打结的长文本结构;然后,LLM被训练来解开这些结,并在看似混乱的令牌序列中识别相关片段。该方法通过在长上下文中准确地处理相关信息,极大地提高了模型的性能,也大大提高了训练效率。我们在具有7B和72B参数的模型上进行了广泛的实验,在200亿个令牌上进行了训练,结果表明,在128K上下文长度下,UtK在标尺上的准确率分别达到了75%和84.5%,显著优于其他长上下文策略。经过训练的模型将开放源代码以供进一步研究。

[NLP-54] Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models
[NLP-54] 后期分块:使用长上下文嵌入模型的上下文块嵌入

链接: https://arxiv.org/abs/2409.04701
作者: Michael Günther,Isabelle Mohr,Bo Wang,Han Xiao
关键词-EN: cases require retrieving, require retrieving smaller, retrieving smaller portions, shorter text segments, dense vector-based retrieval
关键词-ZH: 案例需要检索、需要检索更小、检索更小的部分、更短的文本片段、密集的基于载体的检索
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 4 pages, early draft

点击查看摘要

Abstract:Many use cases require retrieving smaller portions of text, and dense vector-based retrieval systems often perform better with shorter text segments, as the semantics are less likely to be “over-compressed” in the embeddings. Consequently, practitioners often split text documents into smaller chunks and encode them separately. However, chunk embeddings created in this way can lose contextual information from surrounding chunks, resulting in suboptimal representations. In this paper, we introduce a novel method called “late chunking,” which leverages long context embedding models to first embed all tokens of the long text, with chunking applied after the transformer model and just before mean pooling. The resulting chunk embeddings capture the full contextual information, leading to superior results across various retrieval tasks without the need for additional training. Moreover, our method is generic enough to be applied to any long-context embedding model.
摘要:许多用例需要检索较小的文本部分,而密集的基于载体的检索系统通常在文本片段较短时表现得更好,因为语义在嵌入中不太可能被“过度压缩”。因此,从业者经常将文本文档分成较小的块并分别编码。然而,以这种方式创建的块嵌入可能会丢失周围块的上下文信息,从而导致次优的表示。在本文中,我们引入了一种名为“后期分块”的新颖方法,该方法利用长上下文嵌入模型首先嵌入长文本的所有标记,在Transformer模型之后和均值池之前应用分块。由此产生的块嵌入捕获了完整的上下文信息,从而在各种检索任务中获得卓越的结果,而无需额外的训练。此外,我们的方法足够通用,可以应用于任何长上下文嵌入模型。

[NLP-55] QueryBuilder: Human-in-the-Loop Query Development for Information Retrieval
[NLP-55] QuickBuilder:信息检索的人在环查询开发

链接: https://arxiv.org/abs/2409.04667
作者: Hemanth Kandula,Damianos Karakos,Haoling Qiu,Benjamin Rozonoyer,Ian Soboroff,Lee Tarlin,Bonan Min
关键词-EN: define finer-grained queries, finer-grained queries covering, cross-lingual information retrieval, Information Retrieval, important aspects
关键词-ZH: 定义细粒度查询、细粒度查询覆盖、跨语言信息检索、信息检索、重要方面
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Frequently, users of an Information Retrieval (IR) system start with an overarching information need (a.k.a., an analytic task) and proceed to define finer-grained queries covering various important aspects (i.e., sub-topics) of that analytic task. We present a novel, interactive system called \textitQueryBuilder , which allows a novice, English-speaking user to create queries with a small amount of effort, through efficient exploration of an English development corpus in order to rapidly develop cross-lingual information retrieval queries corresponding to the user’s information needs. QueryBuilder performs near real-time retrieval of documents based on user-entered search terms; the user looks through the retrieved documents and marks sentences as relevant to the information needed. The marked sentences are used by the system as additional information in query formation and refinement: query terms (and, optionally, event features, which capture event ‘triggers’ (indicator terms) and agent/patient roles) are appropriately weighted, and a neural-based system, which better captures textual meaning, retrieves other relevant content. The process of retrieval and marking is repeated as many times as desired, giving rise to increasingly refined queries in each iteration. The final product is a fine-grained query used in Cross-Lingual Information Retrieval (CLIR). Our experiments using analytic tasks and requests from the IARPA BETTER IR datasets show that with a small amount of effort (at most 10 minutes per sub-topic), novice users can form \textituseful fine-grained queries including in languages they don’t understand. QueryBuilder also provides beneficial capabilities to the traditional corpus exploration and query formation process. A demonstration video is released at this https URL
摘要:通常,信息检索(IR)系统的用户从总体信息需求(也称为分析任务)开始,并接着定义覆盖该分析任务的各个重要方面(即,子主题)的更细粒度的查询。提出了一种新型的交互式查询系统-。QueryBuilder根据用户输入的搜索词执行近乎实时的文档检索;用户查看检索到的文档,并将句子标记为与所需信息相关。被标记的句子被系统用作查询形成和精化中的附加信息:查询项(以及可选的,捕获事件触发器(指示项)和代理/患者角色的事件特征)被适当加权,并且更好地捕获文本含义的基于神经的系统检索其他相关内容。检索和标记的过程被重复尽可能多的次数,在每次迭代中产生越来越精细的查询。最终的产品是用于跨语言信息检索(CLIR)的细粒度查询。我们使用IARPA Better IR数据集的分析任务和请求进行的实验表明,只需少量努力(每个子主题最多10分钟),新手用户就可以形成有用的细粒度查询,包括他们不理解的语言。QueryBuilder还为传统的语料库探索和查询形成过程提供了有益的功能。演示视频在此HTTPS URL上发布

[NLP-56] Sparse Rewards Can Self-Train Dialogue Agents
[NLP-56] 稀疏的奖励可以自我训练对话代理人

链接: https://arxiv.org/abs/2409.04617
作者: Barrett Martin Lattimer,Varun Gangal,Ryan McDonald,Yi Yang
关键词-EN: Large Language Model, Large Language, multi-turn dialogue tasks, Recent advancements, Language Model
关键词-ZH: 大型语言模型,大型语言,多轮对话任务,最新进展,语言模型
类目: Computation and Language (cs.CL)
备注: Minor but nontrivial changes likely

点击查看摘要

Abstract:Recent advancements in state-of-the-art (SOTA) Large Language Model (LLM) agents, especially in multi-turn dialogue tasks, have been primarily driven by supervised fine-tuning and high-quality human feedback. However, as base LLM models continue to improve, acquiring meaningful human feedback has become increasingly challenging and costly. In certain domains, base LLM agents may eventually exceed human capabilities, making traditional feedback-driven methods impractical. In this paper, we introduce a novel self-improvement paradigm that empowers LLM agents to autonomously enhance their performance without external human feedback. Our method, Juxtaposed Outcomes for Simulation Harvesting (JOSH), is a self-alignment algorithm that leverages a sparse reward simulation environment to extract ideal behaviors and further train the LLM on its own outputs. We present ToolWOZ, a sparse reward tool-calling simulation environment derived from MultiWOZ. We demonstrate that models trained with JOSH, both small and frontier, significantly improve tool-based interactions while preserving general model capabilities across diverse benchmarks. Our code and data are publicly available on GitHub.
摘要:最近在最先进的(SOTA)大语言模型(LLM)代理方面的进展,特别是在多轮对话任务中,主要是由监督微调和高质量的人类反馈推动的。然而,随着基础LLM模型的不断改进,获取有意义的人类反馈变得越来越具有挑战性和成本。在某些领域,基本的LLM代理最终可能超出人类的能力,使得传统的反馈驱动方法不切实际。在本文中,我们引入了一种新的自我改进范式,使LLM代理能够在没有外部人类反馈的情况下自主提高其性能。我们的方法,并行结果模拟收获(JOSH),是一种自对准算法,利用稀疏奖励模拟环境来提取理想行为,并根据自己的输出进一步训练LLM。提出了一种基于MultiWOZ的稀疏奖励工具调用仿真环境–ToolWOZ。我们证明,与Josh一起训练的模型,无论是小型的还是前沿的,都显著改进了基于工具的交互,同时保持了跨不同基准的通用模型能力。我们的代码和数据在GitHub上是公开的。

[NLP-57] BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training
[NLP-57] BPE变得挑剔:Tokenizer培训期间有效的词汇精炼

链接: https://arxiv.org/abs/2409.04599
作者: Pavel Chizhov,Catherine Arnett,Elizaveta Korotkova,Ivan P. Yamshchikov
关键词-EN: Language models, efficient tokenization, models can largely, largely benefit, benefit from efficient
关键词-ZH: 语言模型,高效的标记化,模型可以在很大程度上受益于高效的
类目: Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:Language models can largely benefit from efficient tokenization. However, they still mostly utilize the classical BPE algorithm, a simple and reliable method. This has been shown to cause such issues as under-trained tokens and sub-optimal compression that may affect the downstream performance. We introduce Picky BPE, a modified BPE algorithm that carries out vocabulary refinement during tokenizer training. Our method improves vocabulary efficiency, eliminates under-trained tokens, and does not compromise text compression. Our experiments show that our method does not reduce the downstream performance, and in several cases improves it.
摘要:语言模型可以在很大程度上受益于高效的标记化。然而,他们仍然主要使用经典的BPE算法,这是一种简单可靠的方法。事实证明,这会导致诸如训练不足的令牌和次优压缩等问题,从而可能影响下游性能。我们引入Picky BPE,这是一种修改后的BPE算法,可以在标记器训练期间执行词汇细化。我们的方法提高了词汇效率,消除了训练不足的标记,并且不会损害文本压缩。我们的实验表明,我们的方法不会降低下游性能,而是在某些情况下改进了它。

[NLP-58] Paper Copilot: A Self-Evolving and Efficient LLM System for Personalized Academic Assistance
[NLP-58] Paper Copilot:一个自我进化且高效的LLM系统,用于个性化学术援助

链接: https://arxiv.org/abs/2409.04593
作者: Guanyu Lin,Tao Feng,Pengrui Han,Ge Liu,Jiaxuan You
关键词-EN: reading vast amounts, scientific research proliferates, Paper Copilot, amounts of literature, face the daunting
关键词-ZH: 大量阅读,科学研究激增,论文副驾驶,大量文献,面临艰巨的任务
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As scientific research proliferates, researchers face the daunting task of navigating and reading vast amounts of literature. Existing solutions, such as document QA, fail to provide personalized and up-to-date information efficiently. We present Paper Copilot, a self-evolving, efficient LLM system designed to assist researchers, based on thought-retrieval, user profile and high performance optimization. Specifically, Paper Copilot can offer personalized research services, maintaining a real-time updated database. Quantitative evaluation demonstrates that Paper Copilot saves 69.92% of time after efficient deployment. This paper details the design and implementation of Paper Copilot, highlighting its contributions to personalized academic support and its potential to streamline the research process.
摘要:随着科学研究的激增,研究人员面临着导航和阅读大量文献的艰巨任务。现有的解决方案(例如文档QA)无法有效地提供个性化和最新的信息。我们介绍Paper Copilot,这是一个自我进化的高效LLM系统,旨在基于思想检索、用户配置文件和高性能优化来协助研究人员。具体来说,Paper Copilot可以提供个性化的研究服务,维护实时更新的数据库。定量评估表明,Paper Copilot在高效部署后节省了69.92%的时间。本文详细介绍了Paper Copilot的设计和实现,强调了其对个性化学术支持的贡献及其简化研究流程的潜力。

[NLP-59] Customizing Large Language Model Generation Style using Parameter-Efficient Finetuning
[NLP-59] 使用参数高效微调自定义大型语言模型生成风格

链接: https://arxiv.org/abs/2409.04574
作者: Xinyue Liu,Harshita Diddee,Daphne Ippolito
关键词-EN: large language models, large language, language models, Abstract, LLMs
关键词-ZH: 大型语言模型,大型语言,语言模型,抽象,LLM
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:One-size-fits-all large language models (LLMs) are increasingly being used to help people with their writing. However, the style these models are trained to write in may not suit all users or use cases. LLMs would be more useful as writing assistants if their idiolect could be customized to match each user. In this paper, we explore whether parameter-efficient finetuning (PEFT) with Low-Rank Adaptation can effectively guide the style of LLM generations. We use this method to customize LLaMA-2 to ten different authors and show that the generated text has lexical, syntactic, and surface alignment with the target author but struggles with content memorization. Our findings highlight the potential of PEFT to support efficient, user-level customization of LLMs.
摘要:一刀切的大型语言模型(LLM)越来越多地被用来帮助人们写作。然而,这些模型接受训练以编写的风格可能并不适合所有用户或用例。如果LLM的习语可以定制以匹配每个用户,那么LLM作为写作助理将更有用。在本文中,我们探讨具有低等级自适应的参数高效微调(PEFT)是否可以有效指导LLM世代的风格。我们使用这种方法将LLaMA-2定制为十个不同的作者,并表明生成的文本与目标作者具有词汇、语法和表面对齐,但在内容记忆方面遇到困难。我们的研究结果强调了PEFT支持LLM高效的用户级定制的潜力。

[NLP-60] How Does Code Pretraining Affect Language Model Task Performance?
[NLP-60] 代码预训练如何影响语言模型任务性能?

链接: https://arxiv.org/abs/2409.04556
作者: Jackson Petty,Sjoerd van Steenkiste,Tal Linzen
关键词-EN: Large language models, Large language, increasingly trained, language, Large
关键词-ZH: 大型语言模型,大型语言,训练日益丰富,语言,大型
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models are increasingly trained on corpora containing both natural language and non-linguistic data like source code. Aside from aiding programming-related tasks, anecdotal evidence suggests that including code in pretraining corpora may improve performance on other, unrelated tasks, yet to date no work has been able to establish a causal connection by controlling between language and code data. Here we do just this. We pretrain language models on datasets which interleave natural language and code in two different settings: additive, in which the total volume of data seen during pretraining is held constant; and competitive, in which the volume of language data is held constant. We study how the pretraining mixture affects performance on (a) a diverse collection of tasks included in the BigBench benchmark, and (b) compositionality, measured by generalization accuracy on semantic parsing and syntactic transformations. We find that pretraining on higher proportions of code improves performance on compositional tasks involving structured output (like semantic parsing), and mathematics. Conversely, increase code mixture can harm performance on other tasks, including on tasks that requires sensitivity to linguistic structure such as syntax or morphology, and tasks measuring real-world knowledge.
摘要:大型语言模型越来越多地针对包含自然语言和源代码等非语言数据的语料库进行训练。除了帮助与编程相关的任务外,坊间证据表明,在预培训语料库中包括代码可能会提高其他无关任务的表现,但到目前为止,还没有任何研究能够通过控制语言和代码数据之间的因果联系。在这里,我们只做这件事。我们在两种不同的设置中交织自然语言和代码的数据集上预训练语言模型:加性,其中在预训练期间看到的总数据量保持不变;以及竞争,其中语言数据量保持不变。我们研究了预训练混合如何影响(A)BigB边基准中包括的各种任务集合的性能,以及(B)合成性,通过语义分析和句法转换的泛化准确性来衡量。我们发现,在较高比例的代码上进行预训练可以提高涉及结构化输出(如语义分析)和数学的组合任务的性能。相反,增加代码混合可能会损害其他任务的性能,包括需要对语法或词法等语言结构敏感的任务,以及测量真实世界知识的任务。

[NLP-61] Chain-of-Translation Prompting (CoTR): A Novel Prompting Technique for Low Resource Languages
[NLP-61] 翻译链预算处理(CoTR):一种用于低资源语言的新型预算处理技术

链接: https://arxiv.org/abs/2409.04512
作者: Tejas Deshpande,Nidhi Kowtal,Raviraj Joshi
关键词-EN: paper introduces Chain, Chain of Translation, introduces Chain, Translation Prompting, paper introduces
关键词-ZH: 论文介绍链、翻译链、介绍链、翻译预算、论文介绍
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper introduces Chain of Translation Prompting (CoTR), a novel strategy designed to enhance the performance of language models in low-resource languages. CoTR restructures prompts to first translate the input context from a low-resource language into a higher-resource language, such as English. The specified task like generation, classification, or any other NLP function is then performed on the translated text, with the option to translate the output back to the original language if needed. All these steps are specified in a single prompt. We demonstrate the effectiveness of this method through a case study on the low-resource Indic language Marathi. The CoTR strategy is applied to various tasks, including sentiment analysis, hate speech classification, subject classification and text generation, and its efficacy is showcased by comparing it with regular prompting methods. Our results underscore the potential of translation-based prompting strategies to significantly improve multilingual LLM performance in low-resource languages, offering valuable insights for future research and applications. We specifically see the highest accuracy improvements with the hate speech detection task. The technique also has the potential to enhance the quality of synthetic data generation for underrepresented languages using LLMs.
摘要:本文介绍了翻译提示链策略,该策略旨在提高低资源语言中语言模型的性能。Cotr重构提示,首先将输入上下文从低资源语言翻译成高资源语言,如英语。然后对翻译的文本执行指定的任务,如生成、分类或任何其他NLP功能,如果需要,可以选择将输出翻译回原始语言。所有这些步骤都在一个提示符中指定。通过对低资源的印度语马拉提语的实例研究,验证了该方法的有效性。CoTR策略被应用于各种任务,包括情感分析、仇恨语音分类、主题分类和文本生成,并通过与常规提示方法的比较来展示其有效性。我们的结果强调了基于翻译的提示策略在低资源语言中显著提高多语言LLM性能的潜力,为未来的研究和应用提供了有价值的见解。我们特别看到仇恨言论检测任务的准确率提高了最高。该技术还有可能提高使用LLMS生成未被充分代表的语言的合成数据的质量。

[NLP-62] 3D Data Long-Term Preservation in Cultural Heritage
[NLP-62] 文化遗产的3D数据长期保护

链接: https://arxiv.org/abs/2409.04507
作者: Nicola Amico,Achille Felicetti
关键词-EN: data management strategies, data management, ongoing data management, technological obsolescence, cultural heritage data
关键词-ZH: 数据管理策略、数据管理、持续数据管理、技术过时、文化遗产数据
类目: Information Theory (cs.IT); Computational Geometry (cs.CG); Computation and Language (cs.CL); Digital Libraries (cs.DL); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:The report explores the challenges and strategies for preserving 3D digital data in cultural heritage. It discusses the issue of technological obsolescence, emphasising the need for ustainable storage solutions and ongoing data management strategies. Key topics include understanding technological obsolescence, the lifecycle of digital content, digital continuity, data management plans (DMP), FAIR principles, and the use of public repositories. The report also covers the importance of metadata in long-term digital preservation, including types of metadata and strategies for building valuable metadata. It examines the evolving standards and interoperability in 3D format preservation and the importance of managing metadata and paradata. The document provides a comprehensive overview of the challenges and solutions for preserving 3D cultural heritage data in the long term.
摘要:该报告探讨了保存文化遗产中3D数字数据的挑战和策略。它讨论了技术过时的问题,强调了对USYS存储解决方案和持续数据管理策略的需求。关键主题包括了解技术过时、数字内容的生命周期、数字连续性、数据管理计划(RST)、FAIR原则以及公共存储库的使用。该报告还涵盖了元数据在长期数字保存中的重要性,包括元数据的类型和构建有价值的元数据的策略。它探讨了3D格式保存方面不断发展的标准和互操作性以及管理元数据和paramages的重要性。该文件全面概述了长期保护3D文化遗产数据的挑战和解决方案。

[NLP-63] Leveraging Large Language Models for Solving Rare MIP Challenges
[NLP-63] 利用大型语言模型解决罕见的MPP挑战

链接: https://arxiv.org/abs/2409.04464
作者: Teng Wang,Wing-Yin Yu,Ruifeng She,Wenhan Yang,Taijie Chen,Jianping Zhang
关键词-EN: Mixed Integer Programming, Mixed Integer, Integer Programming, tight time constraints, areas requiring mathematical
关键词-ZH: 混合工作组编程,混合工作组,工作组编程,时间紧迫,需要数学的领域
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Mixed Integer Programming (MIP) has been extensively applied in areas requiring mathematical solvers to address complex instances within tight time constraints. However, as the problem scale increases, the complexity of model formulation and finding feasible solutions escalates significantly. In contrast, the model-building cost for end-to-end models, such as large language models (LLMs), remains largely unaffected by problem scale due to their pattern recognition capabilities. While LLMs, like GPT-4, without fine-tuning, can handle some traditional medium-scale MIP problems, they struggle with uncommon or highly specialized MIP scenarios. Fine-tuning LLMs can yield some feasible solutions for medium-scale MIP instances, but these models typically fail to explore diverse solutions when constrained by a low and constant temperature, limiting their performance. In this paper, we propose and evaluate a recursively dynamic temperature method integrated with a chain-of-thought approach. Our findings show that starting with a high temperature and gradually lowering it leads to better feasible solutions compared to other dynamic temperature strategies. Additionally, by comparing results generated by the LLM with those from Gurobi, we demonstrate that the LLM can produce solutions that complement traditional solvers by accelerating the pruning process and improving overall efficiency.
摘要:混合整数规划(MIP)已被广泛应用于要求数学求解器在严格的时间约束下处理复杂实例的领域。然而,随着问题规模的增大,模型建立和寻找可行解的复杂性显著上升。相比之下,端到端模型(如大型语言模型(LLM))的建模成本基本上不受问题规模的影响,因为它们具有模式识别能力。虽然像GPT-4这样的LLM在没有微调的情况下可以处理一些传统的中型MIP问题,但它们在不常见或高度专业化的MIP场景中举步维艰。微调LLM可以为中等规模的MIP实例提供一些可行的解决方案,但这些模型在受到低温和恒定温度的限制时通常无法探索各种解决方案,从而限制了它们的性能。在本文中,我们提出并评估了一种递归动态温度方法,该方法与思想链方法相结合。我们的发现表明,与其他动态温度策略相比,从较高的温度开始并逐渐降低温度会导致更好的可行解决方案。此外,通过将LLM生成的结果与Gurobi生成的结果进行比较,我们证明了LLM可以通过加速剪枝过程和提高整体效率来生成补充传统求解器的解。

[NLP-64] WET: Overcoming Paraphrasing Vulnerabilities in Embeddings-as-a-Service with Linear Transformation Watermarks
[NLP-64] WET:用线性转换水印克服嵌入即服务中的漏洞解释

链接: https://arxiv.org/abs/2409.04459
作者: Anudeex Shetty,Qiongkai Xu,Jey Han Lau
关键词-EN: supply embeddings generated, large language model, developers to supply, generated by LLMs, service offered
关键词-ZH: 提供生成的嵌入、大型语言模型、要提供的开发人员、由LLM生成、提供的服务
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Work in Progress

点击查看摘要

Abstract:Embeddings-as-a-Service (EaaS) is a service offered by large language model (LLM) developers to supply embeddings generated by LLMs. Previous research suggests that EaaS is prone to imitation attacks – attacks that clone the underlying EaaS model by training another model on the queried embeddings. As a result, EaaS watermarks are introduced to protect the intellectual property of EaaS providers. In this paper, we first show that existing EaaS watermarks can be removed by paraphrasing when attackers clone the model. Subsequently, we propose a novel watermarking technique that involves linearly transforming the embeddings, and show that it is empirically and theoretically robust against paraphrasing.
摘要:嵌入即服务(EASS)是大型语言模型(LLM)开发人员提供的一种服务,用于提供LLM生成的嵌入。之前的研究表明,ESaaS容易受到模仿攻击–通过在查询的嵌入上训练另一个模型来克隆底层ESaaS模型的攻击。因此,引入了ESaaS水印来保护ESaaS提供商的知识产权。在本文中,我们首先表明,当攻击者克隆模型时,可以通过解释来删除现有的EASS水印。随后,我们提出了一种新颖的水印技术,该技术涉及线性变换嵌入,并表明它在经验和理论上都具有鲁棒性,对抗重述。

[NLP-65] Longer is (Not Necessarily) Stronger: Punctuated Long-Sequence Training for Enhanced Speech Recognition and Translation
[NLP-65] 越长(不一定)越强:用于增强语音识别和翻译的间断长序列训练

链接: https://arxiv.org/abs/2409.05601
作者: Nithin Rao Koluguri,Travis Bartley,Hainan Xu,Oleksii Hrinchuk,Jagadeesh Balam,Boris Ginsburg,Georg Kucsko
关键词-EN:
关键词-ZH: Translation interface exception
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Accepted at SLT 2024

点击查看摘要

Translation interface exception

人工智能

[AI-0] Neural MP: A Generalist Neural Motion Planner

链接: https://arxiv.org/abs/2409.05864
作者: Murtaza Dalal,Jiahui Yang,Russell Mendonca,Youssef Khaky,Ruslan Salakhutdinov,Deepak Pathak
关键词-EN: consumes significant amounts, planning generates solutions, computational resources, motion planning, current paradigm
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Website at this http URL . Main paper: 7 pages, 4 figures, 2 tables. Appendix: 9 pages, 5 figures, 6 tables

点击查看摘要

Abstract:The current paradigm for motion planning generates solutions from scratch for every new problem, which consumes significant amounts of time and computational resources. For complex, cluttered scenes, motion planning approaches can often take minutes to produce a solution, while humans are able to accurately and safely reach any goal in seconds by leveraging their prior experience. We seek to do the same by applying data-driven learning at scale to the problem of motion planning. Our approach builds a large number of complex scenes in simulation, collects expert data from a motion planner, then distills it into a reactive generalist policy. We then combine this with lightweight optimization to obtain a safe path for real world deployment. We perform a thorough evaluation of our method on 64 motion planning tasks across four diverse environments with randomized poses, scenes and obstacles, in the real world, demonstrating an improvement of 23%, 17% and 79% motion planning success rate over state of the art sampling, optimization and learning based planning methods. Video results available at this http URL

[AI-1] Promptable Closed-loop Traffic Simulation

链接: https://arxiv.org/abs/2409.05863
作者: Shuhan Tan,Boris Ivanovic,Yuxiao Chen,Boyi Li,Xinshuo Weng,Yulong Cao,Philipp Krähenbühl,Marco Pavone
关键词-EN: autonomous driving development, efficient autonomous driving, cornerstone for safe, safe and efficient, efficient autonomous
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Accepted to CoRL 2024. Website available at this https URL

点击查看摘要

Abstract:Simulation stands as a cornerstone for safe and efficient autonomous driving development. At its core a simulation system ought to produce realistic, reactive, and controllable traffic patterns. In this paper, we propose ProSim, a multimodal promptable closed-loop traffic simulation framework. ProSim allows the user to give a complex set of numerical, categorical or textual prompts to instruct each agent’s behavior and intention. ProSim then rolls out a traffic scenario in a closed-loop manner, modeling each agent’s interaction with other traffic participants. Our experiments show that ProSim achieves high prompt controllability given different user prompts, while reaching competitive performance on the Waymo Sim Agents Challenge when no prompt is given. To support research on promptable traffic simulation, we create ProSim-Instruct-520k, a multimodal prompt-scenario paired driving dataset with over 10M text prompts for over 520k real-world driving scenarios. We will release code of ProSim as well as data and labeling tools of ProSim-Instruct-520k at this https URL.

[AI-2] Applying Attribution Explanations in Truth-Discovery Quantitative Bipolar Argumentation Frameworks

链接: https://arxiv.org/abs/2409.05831
作者: Xiang Yin,Nico Potyka,Francesca Toni
关键词-EN: receiving increasing attention, Bipolar Argumentation Frameworks, Quantitative Bipolar Argumentation, Explaining the strength, Argument Attribution Explanations
类目: Artificial Intelligence (cs.AI)
*备注: This paper has been accepted at ArgXAI Workshop 2024

点击查看摘要

Abstract:Explaining the strength of arguments under gradual semantics is receiving increasing attention. For example, various studies in the literature offer explanations by computing the attribution scores of arguments or edges in Quantitative Bipolar Argumentation Frameworks (QBAFs). These explanations, known as Argument Attribution Explanations (AAEs) and Relation Attribution Explanations (RAEs), commonly employ removal-based and Shapley-based techniques for computing the attribution scores. While AAEs and RAEs have proven useful in several applications with acyclic QBAFs, they remain largely unexplored for cyclic QBAFs. Furthermore, existing applications tend to focus solely on either AAEs or RAEs, but do not compare them directly. In this paper, we apply both AAEs and RAEs, to Truth Discovery QBAFs (TD-QBAFs), which assess the trustworthiness of sources (e.g., websites) and their claims (e.g., the severity of a virus), and feature complex cycles. We find that both AAEs and RAEs can provide interesting explanations and can give non-trivial and surprising insights.

[AI-3] he Future of Software Testing: AI-Powered Test Case Generation and Validation

链接: https://arxiv.org/abs/2409.05808
作者: Mohammad Baqar,Rajat Khanda
关键词-EN: software development lifecycle, development lifecycle, meet necessary functional, crucial phase, testing
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 24 Pages

点击查看摘要

Abstract:Software testing is a crucial phase in the software development lifecycle (SDLC), ensuring that products meet necessary functional, performance, and quality benchmarks before release. Despite advancements in automation, traditional methods of generating and validating test cases still face significant challenges, including prolonged timelines, human error, incomplete test coverage, and high costs of manual intervention. These limitations often lead to delayed product launches and undetected defects that compromise software quality and user satisfaction. The integration of artificial intelligence (AI) into software testing presents a promising solution to these persistent challenges. AI-driven testing methods automate the creation of comprehensive test cases, dynamically adapt to changes, and leverage machine learning to identify high-risk areas in the codebase. This approach enhances regression testing efficiency while expanding overall test coverage. Furthermore, AI-powered tools enable continuous testing and self-healing test cases, significantly reducing manual oversight and accelerating feedback loops, ultimately leading to faster and more reliable software releases. This paper explores the transformative potential of AI in improving test case generation and validation, focusing on its ability to enhance efficiency, accuracy, and scalability in testing processes. It also addresses key challenges associated with adapting AI for testing, including the need for high quality training data, ensuring model transparency, and maintaining a balance between automation and human oversight. Through case studies and examples of real-world applications, this paper illustrates how AI can significantly enhance testing efficiency across both legacy and modern software systems.

[AI-4] Benchmarking Chinese Knowledge Rectification in Large Language Models

链接: https://arxiv.org/abs/2409.05806
作者: Tianhe Lu,Jizhan Fang,Yunzhi Yao,Xin Xu,Ningyu Zhang,Huajun Chen
关键词-EN: Large Language Models, exhibit remarkable generative, remarkable generative capabilities, Language Models, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Ongoing work; code and dataset are available at this https URL

点击查看摘要

Abstract:While Large Language Models (LLMs) exhibit remarkable generative capabilities, they are not without flaws, particularly in the form of hallucinations. This issue is even more pronounced when LLMs are applied to specific languages and domains. For example, LLMs may generate nonsense information when handling Chinese ancient poetry, proverbs, or idioms, owing to the lack of specific knowledge. To this end, this paper introduces a benchmark for rectifying Chinese knowledge in LLMs via knowledge editing. Specifically, we introduce a new Chinese dataset, CKnowEdit, by collecting seven type of knowledge from various sources, including classical texts, idioms, and content from Baidu Tieba Ruozhiba, thereby accounting for the unique polyphony, antithesis, and logical constructs inherent in the Chinese language. Through the analysis of this dataset, we uncover the challenges faced by current LLMs in mastering Chinese. Furthermore, our evaluation of state-of-the-art knowledge editing techniques on this dataset unveil the substantial scope for advancement in the rectification of Chinese knowledge. Code and dataset are available at this https URL.

[AI-5] Enhancing Preference-based Linear Bandits via Human Response Time

链接: https://arxiv.org/abs/2409.05798
作者: Shen Li,Yuyang Zhang,Zhaolin Ren,Claire Liang,Na Li,Julie A. Shah
关键词-EN: Binary human choice, Binary human, preference strength, response times, feedback is widely
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Econometrics (econ.EM); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Binary human choice feedback is widely used in interactive preference learning for its simplicity, but it provides limited information about preference strength. To overcome this limitation, we leverage human response times, which inversely correlate with preference strength, as complementary information. Our work integrates the EZ-diffusion model, which jointly models human choices and response times, into preference-based linear bandits. We introduce a computationally efficient utility estimator that reformulates the utility estimation problem using both choices and response times as a linear regression problem. Theoretical and empirical comparisons with traditional choice-only estimators reveal that for queries with strong preferences (“easy” queries), choices alone provide limited information, while response times offer valuable complementary information about preference strength. As a result, incorporating response times makes easy queries more useful. We demonstrate this advantage in the fixed-budget best-arm identification problem, with simulations based on three real-world datasets, consistently showing accelerated learning when response times are incorporated.

[AI-6] Leveraging Object Priors for Point Tracking ECCV2024

链接: https://arxiv.org/abs/2409.05786
作者: Bikram Boote,Anh Thai,Wenqi Jia,Ozgur Kara,Stefan Stojanov,James M. Rehg,Sangmin Lee
关键词-EN: Point tracking, fundamental problem, problem in computer, computer vision, vision with numerous
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: ECCV 2024 ILR Workshop

点击查看摘要

Abstract:Point tracking is a fundamental problem in computer vision with numerous applications in AR and robotics. A common failure mode in long-term point tracking occurs when the predicted point leaves the object it belongs to and lands on the background or another object. We identify this as the failure to correctly capture objectness properties in learning to track. To address this limitation of prior work, we propose a novel objectness regularization approach that guides points to be aware of object priors by forcing them to stay inside the the boundaries of object instances. By capturing objectness cues at training time, we avoid the need to compute object masks during testing. In addition, we leverage contextual attention to enhance the feature representation for capturing objectness at the feature level more effectively. As a result, our approach achieves state-of-the-art performance on three point tracking benchmarks, and we further validate the effectiveness of our components via ablation studies. The source code is available at: this https URL

[AI-7] NeurLZ: On Systematically Enhancing Lossy Compression Performance for Scientific Data based on Neural Learning with Error Control

链接: https://arxiv.org/abs/2409.05785
作者: Wenqi Jia,Youyuan Liu,Zhewen Hu,Jinzhen Wang,Boyuan Zhang,Wei Niu,Junzhou Huang,Stavros Kalafatis,Sian Jin,Miao Yin
关键词-EN: pose significant challenges, Large-scale scientific simulations, Large-scale scientific, simulations generate massive, generate massive datasets
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large-scale scientific simulations generate massive datasets that pose significant challenges for storage and I/O. While traditional lossy compression techniques can improve performance, balancing compression ratio, data quality, and throughput remains difficult. To address this, we propose NeurLZ, a novel cross-field learning-based and error-controlled compression framework for scientific data. By integrating skipping DNN models, cross-field learning, and error control, our framework aims to substantially enhance lossy compression performance. Our contributions are three-fold: (1) We design a lightweight skipping model to provide high-fidelity detail retention, further improving prediction accuracy. (2) We adopt a cross-field learning approach to significantly improve data prediction accuracy, resulting in a substantially improved compression ratio. (3) We develop an error control approach to provide strict error bounds according to user requirements. We evaluated NeurLZ on several real-world HPC application datasets, including Nyx (cosmological simulation), Miranda (large turbulence simulation), and Hurricane (weather simulation). Experiments demonstrate that our framework achieves up to a 90% relative reduction in bit rate under the same data distortion, compared to the best existing approach.

[AI-8] Creativity and Visual Communication from Machine to Musician: Sharing a Score through a Robotic Camera

链接: https://arxiv.org/abs/2409.05773
作者: Ross Greer,Laura Fleig,Shlomo Dubnov
关键词-EN: Guided Harmony, interaction by implementing, Guided, Harmony, paper explores
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This paper explores the integration of visual communication and musical interaction by implementing a robotic camera within a “Guided Harmony” musical game. We aim to examine co-creative behaviors between human musicians and robotic systems. Our research explores existing methodologies like improvisational game pieces and extends these concepts to include robotic participation using a PTZ camera. The robotic system interprets and responds to nonverbal cues from musicians, creating a collaborative and adaptive musical experience. This initial case study underscores the importance of intuitive visual communication channels. We also propose future research directions, including parameters for refining the visual cue toolkit and data collection methods to understand human-machine co-creativity further. Our findings contribute to the broader understanding of machine intelligence in augmenting human creativity, particularly in musical settings.

[AI-9] Evidence from fMRI Supports a Two-Phase Abstraction Process in Language Models NEURIPS

链接: https://arxiv.org/abs/2409.05771
作者: Emily Cheng,Richard J. Antonello
关键词-EN: hidden states extracted, predict measured brain, measured brain response, Research has repeatedly, intermediate hidden states
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Equal contribution from both authors. Submitted to NeurIPS NeuroAI workshop 2024

点击查看摘要

Abstract:Research has repeatedly demonstrated that intermediate hidden states extracted from large language models are able to predict measured brain response to natural language stimuli. Yet, very little is known about the representation properties that enable this high prediction performance. Why is it the intermediate layers, and not the output layers, that are most capable for this unique and highly general transfer task? In this work, we show that evidence from language encoding models in fMRI supports the existence of a two-phase abstraction process within LLMs. We use manifold learning methods to show that this abstraction process naturally arises over the course of training a language model and that the first “composition” phase of this abstraction process is compressed into fewer layers as training continues. Finally, we demonstrate a strong correspondence between layerwise encoding performance and the intrinsic dimensionality of representations from LLMs. We give initial evidence that this correspondence primarily derives from the inherent compositionality of LLMs and not their next-word prediction properties.

[AI-10] ReL-SAR: Representation Learning for Skeleton Action Recognition with Convolutional Transformers and BYOL

链接: https://arxiv.org/abs/2409.05749
作者: Safwen Naimi,Wassim Bouachir,Guillaume-Alexandre Bilodeau
关键词-EN: challenging task hindered, action recognition features, skeleton action recognition, generalizable skeleton action, large amounts
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8 pages, 4 figures, 6 tables

点击查看摘要

Abstract:To extract robust and generalizable skeleton action recognition features, large amounts of well-curated data are typically required, which is a challenging task hindered by annotation and computation costs. Therefore, unsupervised representation learning is of prime importance to leverage unlabeled skeleton data. In this work, we investigate unsupervised representation learning for skeleton action recognition. For this purpose, we designed a lightweight convolutional transformer framework, named ReL-SAR, exploiting the complementarity of convolutional and attention layers for jointly modeling spatial and temporal cues in skeleton sequences. We also use a Selection-Permutation strategy for skeleton joints to ensure more informative descriptions from skeletal data. Finally, we capitalize on Bootstrap Your Own Latent (BYOL) to learn robust representations from unlabeled skeleton sequence data. We achieved very competitive results on limited-size datasets: MCAD, IXMAS, JHMDB, and NW-UCLA, showing the effectiveness of our proposed method against state-of-the-art methods in terms of both performance and computational efficiency. To ensure reproducibility and reusability, the source code including all implementation parameters is provided at: this https URL

[AI-11] A Novel Idea Generation Tool using a Structured Conversational AI (CAI) System

链接: https://arxiv.org/abs/2409.05747
作者: B. Sankar,Dibakar Sen
关键词-EN: conversational AI-enabled active, AI-enabled active ideation, assist novice designers, active ideation interface, commonly observed
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 21 pages, 16 figures, AIEDAM Journal Article

点击查看摘要

Abstract:This paper presents a novel conversational AI-enabled active ideation interface as a creative idea-generation tool to assist novice designers in mitigating the initial latency and ideation bottlenecks that are commonly observed. It is a dynamic, interactive, and contextually responsive approach, actively involving a large language model (LLM) from the domain of natural language processing (NLP) in artificial intelligence (AI) to produce multiple statements of potential ideas for different design problems. Integrating such AI models with ideation creates what we refer to as an Active Ideation scenario, which helps foster continuous dialogue-based interaction, context-sensitive conversation, and prolific idea generation. A pilot study was conducted with thirty novice designers to generate ideas for given problems using traditional methods and the new CAI-based interface. The key parameters of fluency, novelty, and variety were used to compare the outcomes qualitatively by a panel of experts. The findings demonstrated the effectiveness of the proposed tool for generating prolific, diverse and novel ideas. The interface was enhanced by incorporating a prompt-engineered structured dialogue style for each ideation stage to make it uniform and more convenient for the designers. The resulting responses of such a structured CAI interface were found to be more succinct and aligned towards the subsequent design stage, namely conceptualization. The paper thus established the rich potential of using Generative AI (Gen-AI) for the early ill-structured phase of the creative product design process.

[AI-12] A System and Benchmark for LLM-based QA on Heterogeneous Data

链接: https://arxiv.org/abs/2409.05735
作者: Achille Fokoue,Srideepika Jayaraman,Elham Khabiri,Jeffrey O. Kephart,Yingjie Li,Dhruv Shah,Youssef Drissi,Fenno F. Heath III,Anu Bhamidipaty,Fateh A. Tipu,Robert J.Baseman
关键词-EN: structured data sources, combinations thereof, found in structured, data source, data source heterogeneity
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In many industrial settings, users wish to ask questions whose answers may be found in structured data sources such as a spreadsheets, databases, APIs, or combinations thereof. Often, the user doesn’t know how to identify or access the right data source. This problem is compounded even further if multiple (and potentially siloed) data sources must be assembled to derive the answer. Recently, various Text-to-SQL applications that leverage Large Language Models (LLMs) have addressed some of these problems by enabling users to ask questions in natural language. However, these applications remain impractical in realistic industrial settings because they fail to cope with the data source heterogeneity that typifies such environments. In this paper, we address heterogeneity by introducing the siwarex platform, which enables seamless natural language access to both databases and APIs. To demonstrate the effectiveness of siwarex, we extend the popular Spider dataset and benchmark by replacing some of its tables by data retrieval APIs. We find that siwarex does a good job of coping with data source heterogeneity. Our modified Spider benchmark will soon be available to the research community

[AI-13] What Did My Car Say? Autonomous Vehicle Explanation Errors Context and Personal Traits Impact Comfort Reliance Satisfaction and Driving Confidence

链接: https://arxiv.org/abs/2409.05731
作者: Robert Kaufman,Aaron Broukhim,David Kirsh,Nadir Weibel
关键词-EN: autonomous vehicle, decisions may build, errors, driving, explanation errors
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 23 pages, 4 figures

点击查看摘要

Abstract:Explanations for autonomous vehicle (AV) decisions may build trust, however, explanations can contain errors. In a simulated driving study (n = 232), we tested how AV explanation errors, driving context characteristics (perceived harm and driving difficulty), and personal traits (prior trust and expertise) affected a passenger’s comfort in relying on an AV, preference for control, confidence in the AV’s ability, and explanation satisfaction. Errors negatively affected all outcomes. Surprisingly, despite identical driving, explanation errors reduced ratings of the AV’s driving ability. Severity and potential harm amplified the negative impact of errors. Contextual harm and driving difficulty directly impacted outcome ratings and influenced the relationship between errors and outcomes. Prior trust and expertise were positively associated with outcome ratings. Results emphasize the need for accurate, contextually adaptive, and personalized AV explanations to foster trust, reliance, satisfaction, and confidence. We conclude with design, research, and deployment recommendations for trustworthy AV explanation systems.

[AI-14] Referring Expression Generation in Visually Grounded Dialogue with Discourse-aware Comprehension Guiding

链接: https://arxiv.org/abs/2409.05721
作者: Bram Willemsen,Gabriel Skantze
关键词-EN: referring expression generation, produce referring expressions, visually grounded dialogue, expression generation, referring expression
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for publication at INLG 2024

点击查看摘要

Abstract:We propose an approach to referring expression generation (REG) in visually grounded dialogue that is meant to produce referring expressions (REs) that are both discriminative and discourse-appropriate. Our method constitutes a two-stage process. First, we model REG as a text- and image-conditioned next-token prediction task. REs are autoregressively generated based on their preceding linguistic context and a visual representation of the referent. Second, we propose the use of discourse-aware comprehension guiding as part of a generate-and-rerank strategy through which candidate REs generated with our REG model are reranked based on their discourse-dependent discriminatory power. Results from our human evaluation indicate that our proposed two-stage approach is effective in producing discriminative REs, with higher performance in terms of text-image retrieval accuracy for reranked REs compared to those generated using greedy decoding.

[AI-15] pFedGPA: Diffusion-based Generative Parameter Aggregation for Personalized Federated Learning

链接: https://arxiv.org/abs/2409.05701
作者: Jiahao Lai,Jiaqi Li,Jian Xu,Yanru Wu,Boshi Tang,Siqi Chen,Yongfeng Huang,Wenbo Ding,Yang Li
关键词-EN: Federated Learning, Federated Averaging, data remains local, offers a decentralized, model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) offers a decentralized approach to model training, where data remains local and only model parameters are shared between the clients and the central server. Traditional methods, such as Federated Averaging (FedAvg), linearly aggregate these parameters which are usually trained on heterogeneous data distributions, potentially overlooking the complex, high-dimensional nature of the parameter space. This can result in degraded performance of the aggregated model. While personalized FL approaches can mitigate the heterogeneous data issue to some extent, the limitation of linear aggregation remains unresolved. To alleviate this issue, we investigate the generative approach of diffusion model and propose a novel generative parameter aggregation framework for personalized FL, \textttpFedGPA. In this framework, we deploy a diffusion model on the server to integrate the diverse parameter distributions and propose a parameter inversion method to efficiently generate a set of personalized parameters for each client. This inversion method transforms the uploaded parameters into a latent code, which is then aggregated through denoising sampling to produce the final personalized parameters. By encoding the dependence of a client’s model parameters on the specific data distribution using the high-capacity diffusion model, \textttpFedGPA can effectively decouple the complexity of the overall distribution of all clients’ model parameters from the complexity of each individual client’s parameter distribution. Our experimental results consistently demonstrate the superior performance of the proposed method across multiple datasets, surpassing baseline approaches.

[AI-16] MANA-Net: Mitigating Aggregated Sentiment Homogenization with News Weighting for Enhanced Market Prediction CIKM24

链接: https://arxiv.org/abs/2409.05698
作者: Mengyu Wang,Tiejun Ma
关键词-EN: widely acknowledged, acknowledged that extracting, Aggregated Sentiment Homogenization, extracting market sentiments, data benefits market
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computational Finance (q-fin.CP)
*备注: Accepted by CIKM 24

点击查看摘要

Abstract:It is widely acknowledged that extracting market sentiments from news data benefits market predictions. However, existing methods of using financial sentiments remain simplistic, relying on equal-weight and static aggregation to manage sentiments from multiple news items. This leads to a critical issue termed ``Aggregated Sentiment Homogenization’', which has been explored through our analysis of a large financial news dataset from industry practice. This phenomenon occurs when aggregating numerous sentiments, causing representations to converge towards the mean values of sentiment distributions and thereby smoothing out unique and important information. Consequently, the aggregated sentiment representations lose much predictive value of news data. To address this problem, we introduce the Market Attention-weighted News Aggregation Network (MANA-Net), a novel method that leverages a dynamic market-news attention mechanism to aggregate news sentiments for market prediction. MANA-Net learns the relevance of news sentiments to price changes and assigns varying weights to individual news items. By integrating the news aggregation step into the networks for market prediction, MANA-Net allows for trainable sentiment representations that are optimized directly for prediction. We evaluate MANA-Net using the SP 500 and NASDAQ 100 indices, along with financial news spanning from 2003 to 2018. Experimental results demonstrate that MANA-Net outperforms various recent market prediction methods, enhancing Profit Loss by 1.1% and the daily Sharpe ratio by 0.252.

[AI-17] RegNLP in Action: Facilitating Compliance Through Automated Information Retrieval and Answer Generation

链接: https://arxiv.org/abs/2409.05677
作者: Tuba Gokhan,Kexin Wang,Iryna Gurevych,Ted Briscoe
关键词-EN: governmental regulatory bodies, Natural Language Processing, compliance.Regulatory Natural Language, Regulatory Information Retrieval, issued by governmental
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Emerging Technologies (cs.ET); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Regulatory documents, issued by governmental regulatory bodies, establish rules, guidelines, and standards that organizations must adhere to for legal compliance. These documents, characterized by their length, complexity and frequent updates, are challenging to interpret, requiring significant allocation of time and expertise on the part of organizations to ensure ongoing compliance.Regulatory Natural Language Processing (RegNLP) is a multidisciplinary subfield aimed at simplifying access to and interpretation of regulatory rules and obligations. We define an Automated Question-Passage Generation task for RegNLP, create the ObliQA dataset containing 27,869 questions derived from the Abu Dhabi Global Markets (ADGM) financial regulation document collection, design a baseline Regulatory Information Retrieval and Answer Generation system, and evaluate it with RePASs, a novel evaluation metric that tests whether generated answers accurately capture all relevant obligations and avoid contradictions.

[AI-18] Evaluation of real-time transcriptions using end-to-end ASR models

链接: https://arxiv.org/abs/2409.05674
作者: Carlos Arriaga,Alejandro Pozo,Javier Conde,Alvaro Alonso
关键词-EN: Automatic Speech Recognition, Automatic Speech, Speech Recognition, greatly evolved, ASR
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 15 pages, 4 figures

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) or Speech-to-text (STT) has greatly evolved in the last few years. Traditional architectures based on pipelines have been replaced by joint end-to-end (E2E) architectures that simplify and streamline the model training process. In addition, new AI training methods, such as weak-supervised learning have reduced the need for high-quality audio datasets for model training. However, despite all these advancements, little to no research has been done on real-time transcription. In real-time scenarios, the audio is not pre-recorded, and the input audio must be fragmented to be processed by the ASR systems. To achieve real-time requirements, these fragments must be as short as possible to reduce latency. However, audio cannot be split at any point as dividing an utterance into two separate fragments will generate an incorrect transcription. Also, shorter fragments provide less context for the ASR model. For this reason, it is necessary to design and test different splitting algorithms to optimize the quality and delay of the resulting transcription. In this paper, three audio splitting algorithms are evaluated with different ASR models to determine their impact on both the quality of the transcription and the end-to-end delay. The algorithms are fragmentation at fixed intervals, voice activity detection (VAD), and fragmentation with feedback. The results are compared to the performance of the same model, without audio fragmentation, to determine the effects of this division. The results show that VAD fragmentation provides the best quality with the highest delay, whereas fragmentation at fixed intervals provides the lowest quality and the lowest delay. The newly proposed feedback algorithm exchanges a 2-4% increase in WER for a reduction of 1.5-2s delay, respectively, to the VAD splitting.

[AI-19] Zero-shot Outlier Detection via Prior-data Fitted Networks: Model Selection Bygone!

链接: https://arxiv.org/abs/2409.05672
作者: Yuchen Shen,Haomin Wen,Leman Akoglu
关键词-EN: finds numerous applications, environmental monitoring, finds numerous, numerous applications, applications in environmental
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: preprint

点击查看摘要

Abstract:Outlier detection (OD) has a vast literature as it finds numerous applications in environmental monitoring, cybersecurity, finance, and medicine to name a few. Being an inherently unsupervised task, model selection is a key bottleneck for OD (both algorithm and hyperparameter selection) without label supervision. There is a long list of techniques to choose from – both classical algorithms and deep neural architectures – and while several studies report their hyperparameter sensitivity, the literature is quite slim on unsupervised model selection – limiting the effective use of OD in practice. In this paper we present FoMo-0D, for zero/0-shot OD exploring a transformative new direction that bypasses the hurdle of model selection altogether (!), thus breaking new ground. The fundamental idea behind FoMo-0D is the Prior-data Fitted Networks, recently introduced by Muller et al.(2022), which trains a Transformer model on a large body of synthetically generated data from a prior data distribution. In essence, FoMo-0D is a pretrained Foundation Model for zero/0-shot OD on tabular data, which can directly predict the (outlier/inlier) label of any test data at inference time, by merely a single forward pass – making obsolete the need for choosing an algorithm/architecture, tuning its associated hyperparameters, and even training any model parameters when given a new OD dataset. Extensive experiments on 57 public benchmark datasets against 26 baseline methods show that FoMo-0D performs statistically no different from the top 2nd baseline, while significantly outperforming the majority of the baselines, with an average inference time of 7.7 ms per test sample.

[AI-20] Real-Time Human Action Recognition on Embedded Platforms

链接: https://arxiv.org/abs/2409.05662
作者: Ruiqi Wang,Zichen Wang,Peiqi Gao,Mingzhen Li,Jaehwan Jeong,Yihang Xu,Yejin Lee,Lisa Connor,Chenyang Lu
关键词-EN: video-based human action, human action recognition, motion feature extractor, video-based human, Integrated Motion Feature
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With advancements in computer vision and deep learning, video-based human action recognition (HAR) has become practical. However, due to the complexity of the computation pipeline, running HAR on live video streams incurs excessive delays on embedded platforms. This work tackles the real-time performance challenges of HAR with four contributions: 1) an experimental study identifying a standard Optical Flow (OF) extraction technique as the latency bottleneck in a state-of-the-art HAR pipeline, 2) an exploration of the latency-accuracy tradeoff between the standard and deep learning approaches to OF extraction, which highlights the need for a novel, efficient motion feature extractor, 3) the design of Integrated Motion Feature Extractor (IMFE), a novel single-shot neural network architecture for motion feature extraction with drastic improvement in latency, 4) the development of RT-HARE, a real-time HAR system tailored for embedded platforms. Experimental results on an Nvidia Jetson Xavier NX platform demonstrated that RT-HARE realizes real-time HAR at a video frame rate of 30 frames per second while delivering high levels of recognition accuracy.

[AI-21] Interactive incremental learning of generalizable skills with local trajectory modulation

链接: https://arxiv.org/abs/2409.05655
作者: Markus Knauer,Alin Albu-Schäffer,Freek Stulp,João Silvério
关键词-EN: received considerable attention, received considerable, movement primitives, approaches have emerged, considerable attention
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 21 pages, 16 figures

点击查看摘要

Abstract:The problem of generalization in learning from demonstration (LfD) has received considerable attention over the years, particularly within the context of movement primitives, where a number of approaches have emerged. Recently, two important approaches have gained recognition. While one leverages via-points to adapt skills locally by modulating demonstrated trajectories, another relies on so-called task-parameterized models that encode movements with respect to different coordinate systems, using a product of probabilities for generalization. While the former are well-suited to precise, local modulations, the latter aim at generalizing over large regions of the workspace and often involve multiple objects. Addressing the quality of generalization by leveraging both approaches simultaneously has received little attention. In this work, we propose an interactive imitation learning framework that simultaneously leverages local and global modulations of trajectory distributions. Building on the kernelized movement primitives (KMP) framework, we introduce novel mechanisms for skill modulation from direct human corrective feedback. Our approach particularly exploits the concept of via-points to incrementally and interactively 1) improve the model accuracy locally, 2) add new objects to the task during execution and 3) extend the skill into regions where demonstrations were not provided. We evaluate our method on a bearing ring-loading task using a torque-controlled, 7-DoF, DLR SARA robot.

[AI-22] Replay Consolidation with Label Propagation for Continual Object Detection

链接: https://arxiv.org/abs/2409.05650
作者: Riccardo De Monte,Davide Dalle Pezze,Marina Ceccon,Francesco Pasti,Francesco Paissan,Elisabetta Farella,Gian Antonio Susto,Nicola Bellotto
关键词-EN: highly relevant computer, relevant computer vision, computer vision problem, Object Detection, Continual Learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Object Detection is a highly relevant computer vision problem with many applications such as robotics and autonomous driving. Continual Learning~(CL) considers a setting where a model incrementally learns new information while retaining previously acquired knowledge. This is particularly challenging since Deep Learning models tend to catastrophically forget old knowledge while training on new data. In particular, Continual Learning for Object Detection~(CLOD) poses additional difficulties compared to CL for Classification. In CLOD, images from previous tasks may contain unknown classes that could reappear labeled in future tasks. These missing annotations cause task interference issues for replay-based approaches. As a result, most works in the literature have focused on distillation-based approaches. However, these approaches are effective only when there is a strong overlap of classes across tasks. To address the issues of current methodologies, we propose a novel technique to solve CLOD called Replay Consolidation with Label Propagation for Object Detection (RCLPOD). Based on the replay method, our solution avoids task interference issues by enhancing the buffer memory samples. Our method is evaluated against existing techniques in CLOD literature, demonstrating its superior performance on established benchmarks like VOC and COCO.

[AI-23] 3D-SAR Tomography and Machine Learning for High-Resolution Tree Height Estimation

链接: https://arxiv.org/abs/2409.05636
作者: Grace Colverd,Jumpei Takami,Laura Schade,Karol Bot,Joseph A. Gallego-Mejia
关键词-EN: Accurately estimating forest, Synthetic Aperture Radar, Accurately estimating, climate change mitigation, change mitigation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurately estimating forest biomass is crucial for global carbon cycle modelling and climate change mitigation. Tree height, a key factor in biomass calculations, can be measured using Synthetic Aperture Radar (SAR) technology. This study applies machine learning to extract forest height data from two SAR products: Single Look Complex (SLC) images and tomographic cubes, in preparation for the ESA Biomass Satellite mission. We use the TomoSense dataset, containing SAR and LiDAR data from Germany’s Eifel National Park, to develop and evaluate height estimation models. Our approach includes classical methods, deep learning with a 3D U-Net, and Bayesian-optimized techniques. By testing various SAR frequencies and polarimetries, we establish a baseline for future height and biomass modelling. Best-performing models predict forest height to be within 2.82m mean absolute error for canopies around 30m, advancing our ability to measure global carbon stocks and support climate action.

[AI-24] Joint Input and Output Coordination for Class-Incremental Learning IJCAI2024

链接: https://arxiv.org/abs/2409.05620
作者: Shuai Wang,Yibing Zhan,Yong Luo,Han Hu,Wei Yu,Yonggang Wen,Dacheng Tao
关键词-EN: severe catastrophic forgetting, Incremental learning, catastrophic forgetting, nontrivial due, due to severe
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages, 4 figues. Accepted by IJCAI 2024

点击查看摘要

Abstract:Incremental learning is nontrivial due to severe catastrophic forgetting. Although storing a small amount of data on old tasks during incremental learning is a feasible solution, current strategies still do not 1) adequately address the class bias problem, and 2) alleviate the mutual interference between new and old tasks, and 3) consider the problem of class bias within tasks. This motivates us to propose a joint input and output coordination (JIOC) mechanism to address these issues. This mechanism assigns different weights to different categories of data according to the gradient of the output score, and uses knowledge distillation (KD) to reduce the mutual interference between the outputs of old and new tasks. The proposed mechanism is general and flexible, and can be incorporated into different incremental learning approaches that use memory storage. Extensive experiments show that our mechanism can significantly improve their performance.

[AI-25] Adapted-MoE: Mixture of Experts with Test-Time Adaption for Anomaly Detection

链接: https://arxiv.org/abs/2409.05611
作者: Tianwu Lei,Silin Chen,Bohan Wang,Zhengkai Jiang,Ningmu Zou
关键词-EN: made remarkable progress, unsupervised anomaly detection, recently made remarkable, anomaly detection methods, detection methods based
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Most unsupervised anomaly detection methods based on representations of normal samples to distinguish anomalies have recently made remarkable progress. However, existing methods only learn a single decision boundary for distinguishing the samples within the training dataset, neglecting the variation in feature distribution for normal samples even in the same category in the real world. Furthermore, it was not considered that a distribution bias still exists between the test set and the train set. Therefore, we propose an Adapted-MoE which contains a routing network and a series of expert models to handle multiple distributions of same-category samples by divide and conquer. Specifically, we propose a routing network based on representation learning to route same-category samples into the subclasses feature space. Then, a series of expert models are utilized to learn the representation of various normal samples and construct several independent decision boundaries. We propose the test-time adaption to eliminate the bias between the unseen test sample representation and the feature distribution learned by the expert model. Our experiments are conducted on a dataset that provides multiple subclasses from three categories, namely Texture AD benchmark. The Adapted-MoE significantly improves the performance of the baseline model, achieving 2.18%-7.20% and 1.57%-16.30% increase in I-AUROC and P-AUROC, which outperforms the current state-of-the-art methods. Our code is available at this https URL.

[AI-26] SynMorph: Generating Synthetic Face Morphing Dataset with Mated Samples

链接: https://arxiv.org/abs/2409.05595
作者: Haoyu Zhang,Raghavendra Ramachandra,Kiran Raja,Christoph Busch
关键词-EN: synthetic face morphing, morphing attack detection, face morphing dataset, face recognition systems, proposed synthetic face
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Face morphing attack detection (MAD) algorithms have become essential to overcome the vulnerability of face recognition systems. To solve the lack of large-scale and public-available datasets due to privacy concerns and restrictions, in this work we propose a new method to generate a synthetic face morphing dataset with 2450 identities and more than 100k morphs. The proposed synthetic face morphing dataset is unique for its high-quality samples, different types of morphing algorithms, and the generalization for both single and differential morphing attack detection algorithms. For experiments, we apply face image quality assessment and vulnerability analysis to evaluate the proposed synthetic face morphing dataset from the perspective of biometric sample quality and morphing attack potential on face recognition systems. The results are benchmarked with an existing SOTA synthetic dataset and a representative non-synthetic and indicate improvement compared with the SOTA. Additionally, we design different protocols and study the applicability of using the proposed synthetic dataset on training morphing attack detection algorithms.

[AI-27] ExDDI: Explaining Drug-Drug Interaction Predictions with Natural Language

链接: https://arxiv.org/abs/2409.05592
作者: Zhaoyue Sun,Jiazheng Li,Gabriele Pergola,Yulan He
关键词-EN: improving medication safety, unknown drug-drug interactions, Predicting unknown drug-drug, predicting DDI categories, drug-drug interactions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 17 pages, 4 figures

点击查看摘要

Abstract:Predicting unknown drug-drug interactions (DDIs) is crucial for improving medication safety. Previous efforts in DDI prediction have typically focused on binary classification or predicting DDI categories, with the absence of explanatory insights that could enhance trust in these predictions. In this work, we propose to generate natural language explanations for DDI predictions, enabling the model to reveal the underlying pharmacodynamics and pharmacokinetics mechanisms simultaneously as making the prediction. To do this, we have collected DDI explanations from DDInter and DrugBank and developed various models for extensive experiments and analysis. Our models can provide accurate explanations for unknown DDIs between known drugs. This paper contributes new tools to the field of DDI prediction and lays a solid foundation for further research on generating explanations for DDI predictions.

[AI-28] MemoRAG: Moving towards Next-Gen RAG Via Memory-Inspired Knowledge Discovery

链接: https://arxiv.org/abs/2409.05591
作者: Hongjin Qian,Peitian Zhang,Zheng Liu,Kelong Mao,Zhicheng Dou
关键词-EN: large language models, access external databases, language models, optimized context, access external
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Codes and models are in this https URL

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) leverages retrieval tools to access external databases, thereby enhancing the generation quality of large language models (LLMs) through optimized context. However, the existing retrieval methods are constrained inherently, as they can only perform relevance matching between explicitly stated queries and well-formed knowledge, but unable to handle tasks involving ambiguous information needs or unstructured knowledge. Consequently, existing RAG systems are primarily effective for straightforward question-answering tasks. In this work, we propose \textbfMemoRAG, a novel retrieval-augmented generation paradigm empowered by long-term memory. MemoRAG adopts a dual-system architecture. On the one hand, it employs a \textitlight but long-range LLM to form the global memory of database. Once a task is presented, it generates draft answers, cluing the retrieval tools to locate useful information within the database. On the other hand, it leverages an \textitexpensive but expressive LLM, which generates the ultimate answer based on the retrieved information. Building on this general framework, we further optimize MemoRAG’s performance by enhancing its cluing mechanism and memorization capacity. In our experiment, MemoRAG achieves superior performance across a variety of evaluation tasks, including both complex ones where conventional RAG fails and straightforward ones where RAG is commonly applied.

[AI-29] Interpretable Responsibility Sharing as a Heuristic for Task and Motion Planning

链接: https://arxiv.org/abs/2409.05586
作者: Arda Sarp Yenicesu,Sepehr Nourmohammadi,Berk Cicek,Ozgur S. Oguz
关键词-EN: Interpretable Responsibility Sharing, named Interpretable Responsibility, named Interpretable, Responsibility Sharing, Interpretable Responsibility
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This article introduces a novel heuristic for Task and Motion Planning (TAMP) named Interpretable Responsibility Sharing (IRS), which enhances planning efficiency in domestic robots by leveraging human-constructed environments and inherent biases. Utilizing auxiliary objects (e.g., trays and pitchers), which are commonly found in household settings, IRS systematically incorporates these elements to simplify and optimize task execution. The heuristic is rooted in the novel concept of Responsibility Sharing (RS), where auxiliary objects share the task’s responsibility with the embodied agent, dividing complex tasks into manageable sub-problems. This division not only reflects human usage patterns but also aids robots in navigating and manipulating within human spaces more effectively. By integrating Optimized Rule Synthesis (ORS) for decision-making, IRS ensures that the use of auxiliary objects is both strategic and context-aware, thereby improving the interpretability and effectiveness of robotic planning. Experiments conducted across various household tasks demonstrate that IRS significantly outperforms traditional methods by reducing the effort required in task execution and enhancing the overall decision-making process. This approach not only aligns with human intuitive methods but also offers a scalable solution adaptable to diverse domestic environments. Code is available at this https URL.

[AI-30] Latent 3D Brain MRI Counterfactual

链接: https://arxiv.org/abs/2409.05585
作者: Wei Peng,Tian Xia,Fabio De Sousa Ribeiro,Tomas Bosschieter,Ehsan Adeli,Qingyu Zhao,Ben Glocker,Kilian M. Pohl
关键词-EN: properly train deep, train deep learning, deep learning models, brain MRI studies, number of samples
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The number of samples in structural brain MRI studies is often too small to properly train deep learning models. Generative models show promise in addressing this issue by effectively learning the data distribution and generating high-fidelity MRI. However, they struggle to produce diverse, high-quality data outside the distribution defined by the training data. One way to address the issue is using causal models developed for 3D volume counterfactuals. However, accurately modeling causality in high-dimensional spaces is a challenge so that these models generally generate 3D brain MRIS of lower quality. To address these challenges, we propose a two-stage method that constructs a Structural Causal Model (SCM) within the latent space. In the first stage, we employ a VQ-VAE to learn a compact embedding of the MRI volume. Subsequently, we integrate our causal model into this latent space and execute a three-step counterfactual procedure using a closed-form Generalized Linear Model (GLM). Our experiments conducted on real-world high-resolution MRI data (1mm) demonstrate that our method can generate high-quality 3D MRI counterfactuals.

[AI-31] Learning to Model Graph Structural Information on MLPs via Graph Structure Self-Contrasting

链接: https://arxiv.org/abs/2409.05573
作者: Lirong Wu,Haitao Lin,Guojiang Zhao,Cheng Tan,Stan Z. Li
关键词-EN: Graph Neural Networks, Neural Networks, witnessed great success, handling graph-related tasks, Recent years
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent years have witnessed great success in handling graph-related tasks with Graph Neural Networks (GNNs). However, most existing GNNs are based on message passing to perform feature aggregation and transformation, where the structural information is explicitly involved in the forward propagation by coupling with node features through graph convolution at each layer. As a result, subtle feature noise or structure perturbation may cause severe error propagation, resulting in extremely poor robustness. In this paper, we rethink the roles played by graph structural information in graph data training and identify that message passing is not the only path to modeling structural information. Inspired by this, we propose a simple but effective Graph Structure Self-Contrasting (GSSC) framework that learns graph structural information without message passing. The proposed framework is based purely on Multi-Layer Perceptrons (MLPs), where the structural information is only implicitly incorporated as prior knowledge to guide the computation of supervision signals, substituting the explicit message propagation as in GNNs. Specifically, it first applies structural sparsification to remove potentially uninformative or noisy edges in the neighborhood, and then performs structural self-contrasting in the sparsified neighborhood to learn robust node representations. Finally, structural sparsification and self-contrasting are formulated as a bi-level optimization problem and solved in a unified framework. Extensive experiments have qualitatively and quantitatively demonstrated that the GSSC framework can produce truly encouraging performance with better generalization and robustness than other leading competitors.

[AI-32] On the Convergence of Sigmoid and tanh Fuzzy General Grey Cognitive Maps

链接: https://arxiv.org/abs/2409.05565
作者: Xudong Gao,Xiao Guang Gao,Jia Rong,Ni Li,Yifeng Niu,Jun Chen
关键词-EN: Fuzzy Cognitive Map, Grey Cognitive Map, Fuzzy Grey Cognitive, Cognitive Map, Fuzzy Cognitive
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fuzzy General Grey Cognitive Map (FGGCM) and Fuzzy Grey Cognitive Map (FGCM) are extensions of Fuzzy Cognitive Map (FCM) in terms of uncertainty. FGGCM allows for the processing of general grey number with multiple intervals, enabling FCM to better address uncertain situations. Although the convergence of FCM and FGCM has been discussed in many literature, the convergence of FGGCM has not been thoroughly explored. This paper aims to fill this research gap. First, metrics for the general grey number space and its vector space is given and proved using the Minkowski inequality. By utilizing the characteristic that Cauchy sequences are convergent sequences, the completeness of these two space is demonstrated. On this premise, utilizing Banach fixed point theorem and Browder-Gohde-Kirk fixed point theorem, combined with Lagrange’s mean value theorem and Cauchy’s inequality, deduces the sufficient conditions for FGGCM to converge to a unique fixed point when using tanh and sigmoid functions as activation functions. The sufficient conditions for the kernels and greyness of FGGCM to converge to a unique fixed point are also provided separately. Finally, based on Web Experience and Civil engineering FCM, designed corresponding FGGCM with sigmoid and tanh as activation functions by modifying the weights to general grey numbers. By comparing with the convergence theorems of FCM and FGCM, the effectiveness of the theorems proposed in this paper was verified. It was also demonstrated that the convergence theorems of FCM are special cases of the theorems proposed in this paper. The study for convergence of FGGCM is of great significance for guiding the learning algorithm of FGGCM, which is needed for designing FGGCM with specific fixed points, lays a solid theoretical foundation for the application of FGGCM in fields such as control, prediction, and decision support systems.

[AI-33] LEROjD: Lidar Extended Radar-Only Object Detection ECCV2024

链接: https://arxiv.org/abs/2409.05564
作者: Patrick Palmer,Martin Krüger,Stefan Schütte,Richard Altendorfer,Ganesh Adam,Torsten Bertram
关键词-EN: automated driving, vital for automated, Accurate, object detectors, radar-only object detectors
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Accepted for publication as ECCV 2024

点击查看摘要

Abstract:Accurate 3D object detection is vital for automated driving. While lidar sensors are well suited for this task, they are expensive and have limitations in adverse weather conditions. 3+1D imaging radar sensors offer a cost-effective, robust alternative but face challenges due to their low resolution and high measurement noise. Existing 3+1D imaging radar datasets include radar and lidar data, enabling cross-modal model improvements. Although lidar should not be used during inference, it can aid the training of radar-only object detectors. We explore two strategies to transfer knowledge from the lidar to the radar domain and radar-only object detectors: 1. multi-stage training with sequential lidar point cloud thin-out, and 2. cross-modal knowledge distillation. In the multi-stage process, three thin-out methods are examined. Our results show significant performance gains of up to 4.2 percentage points in mean Average Precision with multi-stage training and up to 3.9 percentage points with knowledge distillation by initializing the student with the teacher’s weights. The main benefit of these approaches is their applicability to other 3D object detection networks without altering their architecture, as we show by analyzing it on two different object detectors. Our code is available at this https URL

[AI-34] CauseJudger: Identifying the Cause with LLMs for Abductive Logical Reasoning

链接: https://arxiv.org/abs/2409.05559
作者: Jinwei He,Feng Lu
关键词-EN: Large language models, encompassing common sense, Large language, solving diverse reasoning, abductive logical reasoning
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have been utilized in solving diverse reasoning tasks, encompassing common sense, arithmetic and deduction tasks. However, with difficulties of reversing thinking patterns and irrelevant premises, how to determine the authenticity of the cause in abductive logical reasoning remains underexplored. Inspired by hypothesis and verification method and identification of irrelevant information in human thinking process, we propose a new framework for LLMs abductive logical reasoning called CauseJudger (CJ), which identifies the authenticity of possible cause by transforming thinking from reverse to forward and removing irrelevant information. In addition, we construct an abductive logical reasoning dataset for decision task called CauseLogics, which contains 200,000 tasks of varying reasoning lengths. Our experiments show the efficiency of CJ with overall experiments and ablation experiments as well as case studies on our dataset and reconstructed public dataset. Notably, CJ’s implementation is efficient, requiring only two calls to LLM. Its impact is profound: when using gpt-3.5, CJ achieves a maximum correctness improvement of 41% compared to Zero-Shot-CoT. Moreover, with gpt-4, CJ attains an accuracy exceeding 90% across all datasets.

[AI-35] Seeing Through the Mask: Rethinking Adversarial Examples for CAPTCHAs

链接: https://arxiv.org/abs/2409.05558
作者: Yahya Jabary,Andreas Plesner,Turlan Kuzhagaliyev,Roger Wattenhofer
关键词-EN: CAPTCHAs rely heavily, rely heavily, Modern CAPTCHAs rely, models, CAPTCHAs rely
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:Modern CAPTCHAs rely heavily on vision tasks that are supposedly hard for computers but easy for humans. However, advances in image recognition models pose a significant threat to such CAPTCHAs. These models can easily be fooled by generating some well-hidden “random” noise and adding it to the image, or hiding objects in the image. However, these methods are model-specific and thus can not aid CAPTCHAs in fooling all models. We show in this work that by allowing for more significant changes to the images while preserving the semantic information and keeping it solvable by humans, we can fool many state-of-the-art models. Specifically, we demonstrate that by adding masks of various intensities the Accuracy @ 1 (Acc@1) drops by more than 50%-points for all models, and supposedly robust models such as vision transformers see an Acc@1 drop of 80%-points. These masks can therefore effectively fool modern image classifiers, thus showing that machines have not caught up with humans – yet. Comments: Under review Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.05558 [cs.CV] (or arXiv:2409.05558v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.05558 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-36] SciAgents : Automating scientific discovery through multi-agent intelligent graph reasoning

链接: https://arxiv.org/abs/2409.05556
作者: Alireza Ghafarollahi,Markus J. Buehler
关键词-EN: identifying complex patterns, advancing scientific understanding, uncovering previously unseen, previously unseen connections, vast scientific data
类目: Artificial Intelligence (cs.AI); Disordered Systems and Neural Networks (cond-mat.dis-nn); Materials Science (cond-mat.mtrl-sci); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A key challenge in artificial intelligence is the creation of systems capable of autonomously advancing scientific understanding by exploring novel domains, identifying complex patterns, and uncovering previously unseen connections in vast scientific data. In this work, we present SciAgents, an approach that leverages three core concepts: (1) the use of large-scale ontological knowledge graphs to organize and interconnect diverse scientific concepts, (2) a suite of large language models (LLMs) and data retrieval tools, and (3) multi-agent systems with in-situ learning capabilities. Applied to biologically inspired materials, SciAgents reveals hidden interdisciplinary relationships that were previously considered unrelated, achieving a scale, precision, and exploratory power that surpasses traditional human-driven research methods. The framework autonomously generates and refines research hypotheses, elucidating underlying mechanisms, design principles, and unexpected material properties. By integrating these capabilities in a modular fashion, the intelligent system yields material discoveries, critique and improve existing hypotheses, retrieve up-to-date data about existing research, and highlights their strengths and limitations. Our case studies demonstrate scalable capabilities to combine generative AI, ontological representations, and multi-agent modeling, harnessing a `swarm of intelligence’ similar to biological systems. This provides new avenues for materials discovery and accelerates the development of advanced materials by unlocking Nature’s design principles.

[AI-37] HMAFlow: Learning More Accurate Optical Flow via Hierarchical Motion Field Alignment

链接: https://arxiv.org/abs/2409.05531
作者: Dianbo Ma,Kousuke Imamura,Ziyan Gao,Xiangjie Wang,Satoshi Yamane
关键词-EN: Optical flow estimation, long-standing visual task, improve optical flow, Optical flow, Motion Field Alignment
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:Optical flow estimation is a fundamental and long-standing visual task. In this work, we present a novel method, dubbed HMAFlow, to improve optical flow estimation in these tough scenes, especially with small objects. The proposed model mainly consists of two core components: a Hierarchical Motion Field Alignment (HMA) module and a Correlation Self-Attention (CSA) module. In addition, we rebuild 4D cost volumes by employing a Multi-Scale Correlation Search (MCS) layer and replacing average pooling in common cost volumes with an search strategy using multiple search ranges. Experimental results demonstrate that our model achieves the best generalization performance in comparison to other state-of-the-art methods. Specifically, compared with RAFT, our method achieves relative error reductions of 14.2% and 3.4% on the clean pass and final pass of the Sintel online benchmark, respectively. On the KITTI test benchmark, HMAFlow surpasses RAFT and GMA in the Fl-all metric by a relative margin of 6.8% and 7.7%, respectively. To facilitate future research, our code will be made available at this https URL.

[AI-38] Harmonic Reasoning in Large Language Models

链接: https://arxiv.org/abs/2409.05521
作者: Anna Kruspe
关键词-EN: Large Language Models, Large Language, including creative tasks, Language Models, including creative
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are becoming very popular and are used for many different purposes, including creative tasks in the arts. However, these models sometimes have trouble with specific reasoning tasks, especially those that involve logical thinking and counting. This paper looks at how well LLMs understand and reason when dealing with musical tasks like figuring out notes from intervals and identifying chords and scales. We tested GPT-3.5 and GPT-4o to see how they handle these tasks. Our results show that while LLMs do well with note intervals, they struggle with more complicated tasks like recognizing chords and scales. This points out clear limits in current LLM abilities and shows where we need to make them better, which could help improve how they think and work in both artistic and other complex areas. We also provide an automatically generated benchmark data set for the described tasks.

[AI-39] Using machine learning for fault detection in lighthouse light sensors

链接: https://arxiv.org/abs/2409.05495
作者: Michael Kampouridis,Nikolaos Vastardis,George Rayment
关键词-EN: ensuring maritime safety, signaling hazardous areas, aiding harbor entries, dangerous coastlines, aerial navigation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Lighthouses play a crucial role in ensuring maritime safety by signaling hazardous areas such as dangerous coastlines, shoals, reefs, and rocks, along with aiding harbor entries and aerial navigation. This is achieved through the use of photoresistor sensors that activate or deactivate based on the time of day. However, a significant issue is the potential malfunction of these sensors, leading to the gradual misalignment of the light’s operational timing. This paper introduces an innovative machine learning-based approach for automatically detecting such malfunctions. We evaluate four distinct algorithms: decision trees, random forest, extreme gradient boosting, and multi-layer perceptron. Our findings indicate that the multi-layer perceptron is the most effective, capable of detecting timing discrepancies as small as 10-15 minutes. This accuracy makes it a highly efficient tool for automating the detection of faults in lighthouse light sensors.

[AI-40] Elsevier Arena: Human Evaluation of Chemistry/Biology/Health Foundational Large Language Models

链接: https://arxiv.org/abs/2409.05486
作者: Camilo Thorne,Christian Druckenbrodt,Kinga Szarkowska,Deepika Goyal,Pranita Marajan,Vijay Somanath,Corey Harper,Mao Yan,Tony Scerri
关键词-EN: assessed with automated, quality and capabilities, fully assessed, benchmark evaluations, Abstract
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 11 pages, 5 tables, 6 figures

点击查看摘要

Abstract:The quality and capabilities of large language models cannot be currently fully assessed with automated, benchmark evaluations. Instead, human evaluations that expand on traditional qualitative techniques from natural language generation literature are required. One recent best-practice consists in using A/B-testing frameworks, which capture preferences of human evaluators for specific models. In this paper we describe a human evaluation experiment focused on the biomedical domain (health, biology, chemistry/pharmacology) carried out at Elsevier. In it a large but not massive (8.8B parameter) decoder-only foundational transformer trained on a relatively small (135B tokens) but highly curated collection of Elsevier datasets is compared to OpenAI’s GPT-3.5-turbo and Meta’s foundational 7B parameter Llama 2 model against multiple criteria. Results indicate – even if IRR scores were generally low – a preference towards GPT-3.5-turbo, and hence towards models that possess conversational abilities, are very large and were trained on very large datasets. But at the same time, indicate that for less massive models training on smaller but well-curated training sets can potentially give rise to viable alternatives in the biomedical domain.

[AI-41] CRADLE-VAE: Enhancing Single-Cell Gene Perturbation Modeling with Counterfactual Reasoning-based Artifact Disentanglement

链接: https://arxiv.org/abs/2409.05484
作者: Seungheun Baek,Soyon Park,Yan Ting Chok,Junhyun Lee,Jueon Park,Mogan Gim,Jaewoo Kang
关键词-EN: Predicting cellular responses, deep learning models, learning models playing, Predicting cellular, personalized therapeutics
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Predicting cellular responses to various perturbations is a critical focus in drug discovery and personalized therapeutics, with deep learning models playing a significant role in this endeavor. Single-cell datasets contain technical artifacts that may hinder the predictability of such models, which poses quality control issues highly regarded in this area. To address this, we propose CRADLE-VAE, a causal generative framework tailored for single-cell gene perturbation modeling, enhanced with counterfactual reasoning-based artifact disentanglement. Throughout training, CRADLE-VAE models the underlying latent distribution of technical artifacts and perturbation effects present in single-cell datasets. It employs counterfactual reasoning to effectively disentangle such artifacts by modulating the latent basal spaces and learns robust features for generating cellular response data with improved quality. Experimental results demonstrate that this approach improves not only treatment effect estimation performance but also generative quality as well. The CRADLE-VAE codebase is publicly available at this https URL.

[AI-42] Proto-OOD: Enhancing OOD Object Detection with Prototype Feature Similarity

链接: https://arxiv.org/abs/2409.05466
作者: Junkun Chen,Jilin Mei,Liang Chen,Fangzhou Zhao,Yu Hu
关键词-EN: object detectors commonly, limited training samples, detectors commonly result, low accuracy, object detectors
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 14pages

点击查看摘要

Abstract:The limited training samples for object detectors commonly result in low accuracy out-of-distribution (OOD) object detection. We have observed that feature vectors of the same class tend to cluster tightly in feature space, whereas those of different classes are more scattered. This insight motivates us to leverage feature similarity for OOD detection. Drawing on the concept of prototypes prevalent in few-shot learning, we introduce a novel network architecture, Proto-OOD, designed for this purpose. Proto-OOD enhances prototype representativeness through contrastive loss and identifies OOD data by assessing the similarity between input features and prototypes. It employs a negative embedding generator to create negative embedding, which are then used to train the similarity module. Proto-OOD achieves significantly lower FPR95 in MS-COCO dataset and higher mAP for Pascal VOC dataset, when utilizing Pascal VOC as ID dataset and MS-COCO as OOD dataset. Additionally, we identify limitations in existing evaluation metrics and propose an enhanced evaluation protocol.

[AI-43] Visualizing Extensions of Argumentation Frameworks as Layered Graphs

链接: https://arxiv.org/abs/2409.05457
作者: Martin Nöllenburg,Christian Pirker,Anna Rapberger,Stefan Woltran,Jules Wulms
关键词-EN: argumentation frameworks, crucial for enabling, enabling a wide, wide applicability, applicability of argumentative
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The visualization of argumentation frameworks (AFs) is crucial for enabling a wide applicability of argumentative tools. However, their visualization is often considered only as an accompanying part of tools for computing semantics and standard graphical representations are used. We introduce a new visualization technique that draws an AF, together with an extension (as part of the input), as a 3-layer graph layout. Our technique supports the user to more easily explore the visualized AF, better understand extensions, and verify algorithms for computing semantics. To optimize the visual clarity and aesthetics of this layout, we propose to minimize edge crossings in our 3-layer drawing. We do so by an exact ILP-based approach, but also propose a fast heuristic pipeline. Via a quantitative evaluation, we show that the heuristic is feasible even for large instances, while producing at most twice as many crossings as an optimal drawing in most cases.

[AI-44] Semifactual Explanations for Reinforcement Learning

链接: https://arxiv.org/abs/2409.05435
作者: Jasmina Gajcin,Jovan Jeromela,Ivana Dusparic
关键词-EN: Deep reinforcement learning, Reinforcement Learning, Semifactual explanations, trial and error, learning paradigm
类目: Artificial Intelligence (cs.AI)
*备注: 9 pages, 2 figures, 4 tables

点击查看摘要

Abstract:Reinforcement Learning (RL) is a learning paradigm in which the agent learns from its environment through trial and error. Deep reinforcement learning (DRL) algorithms represent the agent’s policies using neural networks, making their decisions difficult to interpret. Explaining the behaviour of DRL agents is necessary to advance user trust, increase engagement, and facilitate integration with real-life tasks. Semifactual explanations aim to explain an outcome by providing “even if” scenarios, such as “even if the car were moving twice as slowly, it would still have to swerve to avoid crashing”. Semifactuals help users understand the effects of different factors on the outcome and support the optimisation of resources. While extensively studied in psychology and even utilised in supervised learning, semifactuals have not been used to explain the decisions of RL systems. In this work, we develop a first approach to generating semifactual explanations for RL agents. We start by defining five properties of desirable semifactual explanations in RL and then introducing SGRL-Rewind and SGRL-Advance, the first algorithms for generating semifactual explanations in RL. We evaluate the algorithms in two standard RL environments and find that they generate semifactuals that are easier to reach, represent the agent’s policy better, and are more diverse compared to baselines. Lastly, we conduct and analyse a user study to assess the participant’s perception of semifactual explanations of the agent’s actions.

[AI-45] State-Novelty Guided Action Persistence in Deep Reinforcement Learning

链接: https://arxiv.org/abs/2409.05433
作者: Jianshu Hu,Paul Weng,Yutong Ban
关键词-EN: deep reinforcement learning, promising approach, deep reinforcement, reinforcement learning, exploration-exploitation dilemma
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:While a powerful and promising approach, deep reinforcement learning (DRL) still suffers from sample inefficiency, which can be notably improved by resorting to more sophisticated techniques to address the exploration-exploitation dilemma. One such technique relies on action persistence (i.e., repeating an action over multiple steps). However, previous work exploiting action persistence either applies a fixed strategy or learns additional value functions (or policy) for selecting the repetition number. In this paper, we propose a novel method to dynamically adjust the action persistence based on the current exploration status of the state space. In such a way, our method does not require training of additional value functions or policy. Moreover, the use of a smooth scheduling of the repeat probability allows a more effective balance between exploration and exploitation. Furthermore, our method can be seamlessly integrated into various basic exploration strategies to incorporate temporal persistence. Finally, extensive experiments on different DMControl tasks demonstrate that our state-novelty guided action persistence method significantly improves the sample efficiency.

[AI-46] AD-Net: Attention-based dilated convolutional residual network with guided decoder for robust skin lesion segmentation

链接: https://arxiv.org/abs/2409.05420
作者: Asim Naveed,Syed S. Naqvi,Tariq M. Khan,Shahzaib Iqbal,M. Yaqoob Wani,Haroon Ahmed Khan
关键词-EN: computer-aided diagnosis tools, skin cancer treatment, skin lesion segmentation, diagnosis tools employed, computer-aided diagnosis
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In computer-aided diagnosis tools employed for skin cancer treatment and early diagnosis, skin lesion segmentation is important. However, achieving precise segmentation is challenging due to inherent variations in appearance, contrast, texture, and blurry lesion boundaries. This research presents a robust approach utilizing a dilated convolutional residual network, which incorporates an attention-based spatial feature enhancement block (ASFEB) and employs a guided decoder strategy. In each dilated convolutional residual block, dilated convolution is employed to broaden the receptive field with varying dilation rates. To improve the spatial feature information of the encoder, we employed an attention-based spatial feature enhancement block in the skip connections. The ASFEB in our proposed method combines feature maps obtained from average and maximum-pooling operations. These combined features are then weighted using the active outcome of global average pooling and convolution operations. Additionally, we have incorporated a guided decoder strategy, where each decoder block is optimized using an individual loss function to enhance the feature learning process in the proposed AD-Net. The proposed AD-Net presents a significant benefit by necessitating fewer model parameters compared to its peer methods. This reduction in parameters directly impacts the number of labeled data required for training, facilitating faster convergence during the training process. The effectiveness of the proposed AD-Net was evaluated using four public benchmark datasets. We conducted a Wilcoxon signed-rank test to verify the efficiency of the AD-Net. The outcomes suggest that our method surpasses other cutting-edge methods in performance, even without the implementation of data augmentation strategies.

[AI-47] CipherDM: Secure Three-Party Inference for Diffusion Model Sampling

链接: https://arxiv.org/abs/2409.05414
作者: Xin Zhao,Xiaojun Chen,Xudong Chen,He Li,Tingyu Fan,Zhendong Zhao
关键词-EN: Diffusion Models, synthesis results, results in image, image generation, Diffusion
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion Models (DMs) achieve state-of-the-art synthesis results in image generation and have been applied to various fields. However, DMs sometimes seriously violate user privacy during usage, making the protection of privacy an urgent issue. Using traditional privacy computing schemes like Secure Multi-Party Computation (MPC) directly in DMs faces significant computation and communication challenges. To address these issues, we propose CipherDM, the first novel, versatile and universal framework applying MPC technology to DMs for secure sampling, which can be widely implemented on multiple DM based tasks. We thoroughly analyze sampling latency breakdown, find time-consuming parts and design corresponding secure MPC protocols for computing nonlinear activations including SoftMax, SiLU and Mish. CipherDM is evaluated on popular architectures (DDPM, DDIM) using MNIST dataset and on SD deployed by diffusers. Compared to direct implementation on SPU, our approach improves running time by approximately 1.084\times \sim 2.328\times, and reduces communication costs by approximately 1.212\times \sim 1.791\times.

[AI-48] A Survey of Multimodal Composite Editing and Retrieval

链接: https://arxiv.org/abs/2409.05405
作者: Suyan Li,Fuxiang Huang,Lei Zhang
关键词-EN: improve retrieval systems, Multimodal composite retrieval, composite retrieval, real world, focus of research
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注: 22 pages, 3 figures, and 11 tables

点击查看摘要

Abstract:In the real world, where information is abundant and diverse across different modalities, understanding and utilizing various data types to improve retrieval systems is a key focus of research. Multimodal composite retrieval integrates diverse modalities such as text, image and audio, etc. to provide more accurate, personalized, and contextually relevant results. To facilitate a deeper understanding of this promising direction, this survey explores multimodal composite editing and retrieval in depth, covering image-text composite editing, image-text composite retrieval, and other multimodal composite retrieval. In this survey, we systematically organize the application scenarios, methods, benchmarks, experiments, and future directions. Multimodal learning is a hot topic in large model era, and have also witnessed some surveys in multimodal learning and vision-language models with transformers published in the PAMI journal. To the best of our knowledge, this survey is the first comprehensive review of the literature on multimodal composite retrieval, which is a timely complement of multimodal fusion to existing reviews. To help readers’ quickly track this field, we build the project page for this survey, which can be found at this https URL.

[AI-49] HyperSMOTE: A Hypergraph-based Oversampling Approach for Imbalanced Node Classifications

链接: https://arxiv.org/abs/2409.05402
作者: Ziming Zhao,Tiehua Zhang,Zijian Yi,Zhishu Shen
关键词-EN: extract higher-order relationships, data scenarios due, compared to traditional, multimodal data scenarios, increasingly utilized
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Hypergraphs are increasingly utilized in both unimodal and multimodal data scenarios due to their superior ability to model and extract higher-order relationships among nodes, compared to traditional graphs. However, current hypergraph models are encountering challenges related to imbalanced data, as this imbalance can lead to biases in the model towards the more prevalent classes. While the existing techniques, such as GraphSMOTE, have improved classification accuracy for minority samples in graph data, they still fall short when addressing the unique structure of hypergraphs. Inspired by SMOTE concept, we propose HyperSMOTE as a solution to alleviate the class imbalance issue in hypergraph learning. This method involves a two-step process: initially synthesizing minority class nodes, followed by the nodes integration into the original hypergraph. We synthesize new nodes based on samples from minority classes and their neighbors. At the same time, in order to solve the problem on integrating the new node into the hypergraph, we train a decoder based on the original hypergraph incidence matrix to adaptively associate the augmented node to hyperedges. We conduct extensive evaluation on multiple single-modality datasets, such as Cora, Cora-CA and Citeseer, as well as multimodal conversation dataset MELD to verify the effectiveness of HyperSMOTE, showing an average performance gain of 3.38% and 2.97% on accuracy, respectively.

[AI-50] FacialFlowNet: Advancing Facial Optical Flow Estimation with a Diverse Dataset and a Decomposed Model

链接: https://arxiv.org/abs/2409.05396
作者: Jianzhi Lu,Ruian He,Shili Zhou,Weimin Tan,Bo Yan
关键词-EN: facial optical flow, optical flow, Facial movements play, Facial, flow
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: ACMMM2024

点击查看摘要

Abstract:Facial movements play a crucial role in conveying altitude and intentions, and facial optical flow provides a dynamic and detailed representation of it. However, the scarcity of datasets and a modern baseline hinders the progress in facial optical flow research. This paper proposes FacialFlowNet (FFN), a novel large-scale facial optical flow dataset, and the Decomposed Facial Flow Model (DecFlow), the first method capable of decomposing facial flow. FFN comprises 9,635 identities and 105,970 image pairs, offering unprecedented diversity for detailed facial and head motion analysis. DecFlow features a facial semantic-aware encoder and a decomposed flow decoder, excelling in accurately estimating and decomposing facial flow into head and expression components. Comprehensive experiments demonstrate that FFN significantly enhances the accuracy of facial flow estimation across various optical flow methods, achieving up to an 11% reduction in Endpoint Error (EPE) (from 3.91 to 3.48). Moreover, DecFlow, when coupled with FFN, outperforms existing methods in both synthetic and real-world scenarios, enhancing facial expression analysis. The decomposed expression flow achieves a substantial accuracy improvement of 18% (from 69.1% to 82.1%) in micro-expressions recognition. These contributions represent a significant advancement in facial motion analysis and optical flow estimation. Codes and datasets can be found.

[AI-51] Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision Language Modeling

链接: https://arxiv.org/abs/2409.05395
作者: Georgios Pantazopoulos,Malvina Nikandrou,Alessandro Suglia,Oliver Lemon,Arash Eshghi
关键词-EN: Visual Language Models, Language Models, study explores replacing, recent structured state, structured state space
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study explores replacing Transformers in Visual Language Models (VLMs) with Mamba, a recent structured state space model (SSM) that demonstrates promising performance in sequence modeling. We test models up to 3B parameters under controlled conditions, showing that Mamba-based VLMs outperforms Transformers-based VLMs in captioning, question answering, and reading comprehension. However, we find that Transformers achieve greater performance in visual grounding and the performance gap widens with scale. We explore two hypotheses to explain this phenomenon: 1) the effect of task-agnostic visual encoding on the updates of the hidden states, and 2) the difficulty in performing visual grounding from the perspective of in-context multimodal retrieval. Our results indicate that a task-aware encoding yields minimal performance gains on grounding, however, Transformers significantly outperform Mamba at in-context multimodal retrieval. Overall, Mamba shows promising performance on tasks where the correct output relies on a summary of the image but struggles when retrieval of explicit information from the context is required.

[AI-52] owards Building a Robust Knowledge Intensive Question Answering Model with Large Language Models NLPCC-2024

链接: https://arxiv.org/abs/2409.05385
作者: Hong Xingyun Hong,Shao Yan Shao,Wang Zhilin Wang,Duan Manni Duan,Jin Xiongnan
关键词-EN: question answering, utilize external information, greatly enhanced, enhanced the intelligence, intelligence and fluency
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: This paper has been accepted by NLPCC-2024

点击查看摘要

Abstract:The development of LLMs has greatly enhanced the intelligence and fluency of question answering, while the emergence of retrieval enhancement has enabled models to better utilize external information. However, the presence of noise and errors in retrieved information poses challenges to the robustness of LLMs. In this work, to evaluate the model’s performance under multiple interferences, we first construct a dataset based on machine reading comprehension datasets simulating various scenarios, including critical information absence, noise, and conflicts. To address the issue of model accuracy decline caused by noisy external information, we propose a data augmentation-based fine-tuning method to enhance LLM’s robustness against noise. Additionally, contrastive learning approach is utilized to preserve the model’s discrimination capability of external information. We have conducted experiments on both existing LLMs and our approach, the results are evaluated by GPT-4, which indicates that our proposed methods improve model robustness while strengthening the model’s discrimination capability.

[AI-53] Look One and More: Distilling Hybrid Order Relational Knowledge for Cross-Resolution Image Recognition AAAI2020

链接: https://arxiv.org/abs/2409.05384
作者: Shiming Ge,Kangkai Zhang,Haolin Liu,Yingying Hua,Shengwei Zhao,Xin Jin,Hao Wen
关键词-EN: recent deep models, low accuracy due, directly applying, resolution degradation, low-resolution images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: Accepted by AAAI 2020

点击查看摘要

Abstract:In spite of great success in many image recognition tasks achieved by recent deep models, directly applying them to recognize low-resolution images may suffer from low accuracy due to the missing of informative details during resolution degradation. However, these images are still recognizable for subjects who are familiar with the corresponding high-resolution ones. Inspired by that, we propose a teacher-student learning approach to facilitate low-resolution image recognition via hybrid order relational knowledge distillation. The approach refers to three streams: the teacher stream is pretrained to recognize high-resolution images in high accuracy, the student stream is learned to identify low-resolution images by mimicking the teacher’s behaviors, and the extra assistant stream is introduced as bridge to help knowledge transfer across the teacher to the student. To extract sufficient knowledge for reducing the loss in accuracy, the learning of student is supervised with multiple losses, which preserves the similarities in various order relational structures. In this way, the capability of recovering missing details of familiar low-resolution images can be effectively enhanced, leading to a better knowledge transfer. Extensive experiments on metric learning, low-resolution image classification and low-resolution face recognition tasks show the effectiveness of our approach, while taking reduced models.

[AI-54] Deep Learning for Video Anomaly Detection: A Review

链接: https://arxiv.org/abs/2409.05383
作者: Peng Wu,Chengyu Pan,Yuting Yan,Guansong Pang,Peng Wang,Yanning Zhang
关键词-EN: Video anomaly detection, Video anomaly, VAD, aims to discover, discover behaviors
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Video anomaly detection (VAD) aims to discover behaviors or events deviating from the normality in videos. As a long-standing task in the field of computer vision, VAD has witnessed much good progress. In the era of deep learning, with the explosion of architectures of continuously growing capability and capacity, a great variety of deep learning based methods are constantly emerging for the VAD task, greatly improving the generalization ability of detection algorithms and broadening the application scenarios. Therefore, such a multitude of methods and a large body of literature make a comprehensive survey a pressing necessity. In this paper, we present an extensive and comprehensive research review, covering the spectrum of five different categories, namely, semi-supervised, weakly supervised, fully supervised, unsupervised and open-set supervised VAD, and we also delve into the latest VAD works based on pre-trained large models, remedying the limitations of past reviews in terms of only focusing on semi-supervised VAD and small model based methods. For the VAD task with different levels of supervision, we construct a well-organized taxonomy, profoundly discuss the characteristics of different types of methods, and show their performance comparisons. In addition, this review involves the public datasets, open-source codes, and evaluation metrics covering all the aforementioned VAD tasks. Finally, we provide several important research directions for the VAD community.

[AI-55] PersonaTalk: Bring Attention to Your Persona in Visual Dubbing SIGGRAPH

链接: https://arxiv.org/abs/2409.05379
作者: Longhao Zhang,Shuang Liang,Zhipeng Ge,Tianshu Hu
关键词-EN: accurate lip synchronization, synthesizing accurate lip, audio-driven visual dubbing, lip synchronization, remains a considerable
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: Accepted at SIGGRAPH Asia 2024 (Conference Track)

点击查看摘要

Abstract:For audio-driven visual dubbing, it remains a considerable challenge to uphold and highlight speaker’s persona while synthesizing accurate lip synchronization. Existing methods fall short of capturing speaker’s unique speaking style or preserving facial details. In this paper, we present PersonaTalk, an attention-based two-stage framework, including geometry construction and face rendering, for high-fidelity and personalized visual dubbing. In the first stage, we propose a style-aware audio encoding module that injects speaking style into audio features through a cross-attention layer. The stylized audio features are then used to drive speaker’s template geometry to obtain lip-synced geometries. In the second stage, a dual-attention face renderer is introduced to render textures for the target geometries. It consists of two parallel cross-attention layers, namely Lip-Attention and Face-Attention, which respectively sample textures from different reference frames to render the entire face. With our innovative design, intricate facial details can be well preserved. Comprehensive experiments and user studies demonstrate our advantages over other state-of-the-art methods in terms of visual quality, lip-sync accuracy and persona preservation. Furthermore, as a person-generic framework, PersonaTalk can achieve competitive performance as state-of-the-art person-specific methods. Project Page: this https URL.

[AI-56] KARGEN: Knowledge-enhanced Automated Radiology Report Generation Using Large Language Models

链接: https://arxiv.org/abs/2409.05370
作者: Yingshu Li,Zhanyu Wang,Yunyi Liu,Lei Wang,Lingqiao Liu,Luping Zhou
关键词-EN: Large Language Models, Harnessing the robust, Large Language, Language Models, automated radiology report
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Harnessing the robust capabilities of Large Language Models (LLMs) for narrative generation, logical reasoning, and common-sense knowledge integration, this study delves into utilizing LLMs to enhance automated radiology report generation (R2Gen). Despite the wealth of knowledge within LLMs, efficiently triggering relevant knowledge within these large models for specific tasks like R2Gen poses a critical research challenge. This paper presents KARGEN, a Knowledge-enhanced Automated radiology Report GENeration framework based on LLMs. Utilizing a frozen LLM to generate reports, the framework integrates a knowledge graph to unlock chest disease-related knowledge within the LLM to enhance the clinical utility of generated reports. This is achieved by leveraging the knowledge graph to distill disease-related features in a designed way. Since a radiology report encompasses both normal and disease-related findings, the extracted graph-enhanced disease-related features are integrated with regional image features, attending to both aspects. We explore two fusion methods to automatically prioritize and select the most relevant features. The fused features are employed by LLM to generate reports that are more sensitive to diseases and of improved quality. Our approach demonstrates promising results on the MIMIC-CXR and IU-Xray datasets.

[AI-57] BAMDP Shaping: a Unified Theoretical Framework for Intrinsic Motivation and Reward Shaping

链接: https://arxiv.org/abs/2409.05358
作者: Aly Lidayan,Michael Dennis,Stuart Russell
关键词-EN: Intrinsic motivation, Markov Decision Processes, reinforcement learning, agents by adding, common methods
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Intrinsic motivation (IM) and reward shaping are common methods for guiding the exploration of reinforcement learning (RL) agents by adding pseudo-rewards. Designing these rewards is challenging, however, and they can counter-intuitively harm performance. To address this, we characterize them as reward shaping in Bayes-Adaptive Markov Decision Processes (BAMDPs), which formalizes the value of exploration by formulating the RL process as updating a prior over possible MDPs through experience. RL algorithms can be viewed as BAMDP policies; instead of attempting to find optimal algorithms by solving BAMDPs directly, we use it at a theoretical framework for understanding how pseudo-rewards guide suboptimal algorithms. By decomposing BAMDP state value into the value of the information collected plus the prior value of the physical state, we show how psuedo-rewards can help by compensating for RL algorithms’ misestimation of these two terms, yielding a new typology of IM and reward shaping approaches. We carefully extend the potential-based shaping theorem to BAMDPs to prove that when pseudo-rewards are BAMDP Potential-based shaping Functions (BAMPFs), they preserve optimal, or approximately optimal, behavior of RL algorithms; otherwise, they can corrupt even optimal learners. We finally give guidance on how to design or convert existing pseudo-rewards to BAMPFs by expressing assumptions about the environment as potential functions on BAMDP states.

[AI-58] riplePlay: Enhancing Federated Learning with CLIP for Non-IID Data and Resource Efficiency

链接: https://arxiv.org/abs/2409.05347
作者: Ahmed Imteaj,Md Zarif Hossain,Saika Zaman,Abdur R. Shahid
关键词-EN: offer significant opportunities, privacy-preserving artificial intelligence, Federated Learning, offer significant, artificial intelligence
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid advancement and increasing complexity of pretrained models, exemplified by CLIP, offer significant opportunities as well as challenges for Federated Learning (FL), a critical component of privacy-preserving artificial intelligence. This research delves into the intricacies of integrating large foundation models like CLIP within FL frameworks to enhance privacy, efficiency, and adaptability across heterogeneous data landscapes. It specifically addresses the challenges posed by non-IID data distributions, the computational and communication overheads of leveraging such complex models, and the skewed representation of classes within datasets. We propose TriplePlay, a framework that integrates CLIP as an adapter to enhance FL’s adaptability and performance across diverse data distributions. This approach addresses the long-tail distribution challenge to ensure fairness while reducing resource demands through quantization and low-rank adaptation techniques.Our simulation results demonstrate that TriplePlay effectively decreases GPU usage costs and speeds up the learning process, achieving convergence with reduced communication overhead.

[AI-59] GDFlow: Anomaly Detection with NCDE-based Normalizing Flow for Advanced Driver Assistance System

链接: https://arxiv.org/abs/2409.05346
作者: Kangjun Lee,Minha Kim,Youngho Jun,Simon S. Woo
关键词-EN: Adaptive Cruise Control, Driver Assistance Systems, Advanced Driver Assistance, Cruise Control, Assistance Systems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:For electric vehicles, the Adaptive Cruise Control (ACC) in Advanced Driver Assistance Systems (ADAS) is designed to assist braking based on driving conditions, road inclines, predefined deceleration strengths, and user braking patterns. However, the driving data collected during the development of ADAS are generally limited and lack diversity. This deficiency leads to late or aggressive braking for different users. Crucially, it is necessary to effectively identify anomalies, such as unexpected or inconsistent braking patterns in ADAS, especially given the challenge of working with unlabelled, limited, and noisy datasets from real-world electric vehicles. In order to tackle the aforementioned challenges in ADAS, we propose Graph Neural Controlled Differential Equation Normalizing Flow (GDFlow), a model that leverages Normalizing Flow (NF) with Neural Controlled Differential Equations (NCDE) to learn the distribution of normal driving patterns continuously. Compared to the traditional clustering or anomaly detection algorithms, our approach effectively captures the spatio-temporal information from different sensor data and more accurately models continuous changes in driving patterns. Additionally, we introduce a quantile-based maximum likelihood objective to improve the likelihood estimate of the normal data near the boundary of the distribution, enhancing the model’s ability to distinguish between normal and anomalous patterns. We validate GDFlow using real-world electric vehicle driving data that we collected from Hyundai IONIQ5 and GV80EV, achieving state-of-the-art performance compared to six baselines across four dataset configurations of different vehicle types and drivers. Furthermore, our model outperforms the latest anomaly detection methods across four time series benchmark datasets. Our approach demonstrates superior efficiency in inference time compared to existing methods.

[AI-60] Early-exit Convolutional Neural Networks

链接: https://arxiv.org/abs/2409.05336
作者: Edanur Demir,Emre Akbas
关键词-EN: convolutional neural networks, computational cost, aimed at developing, developing a method, method that reduces
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper is aimed at developing a method that reduces the computational cost of convolutional neural networks (CNN) during inference. Conventionally, the input data pass through a fixed neural network architecture. However, easy examples can be classified at early stages of processing and conventional networks do not take this into account. In this paper, we introduce ‘Early-exit CNNs’, EENets for short, which adapt their computational cost based on the input by stopping the inference process at certain exit locations. In EENets, there are a number of exit blocks each of which consists of a confidence branch and a softmax branch. The confidence branch computes the confidence score of exiting (i.e. stopping the inference process) at that location; while the softmax branch outputs a classification probability vector. Both branches are learnable and their parameters are separate. During training of EENets, in addition to the classical classification loss, the computational cost of inference is taken into account as well. As a result, the network adapts its many confidence branches to the inputs so that less computation is spent for easy examples. Inference works as in conventional feed-forward networks, however, when the output of a confidence branch is larger than a certain threshold, the inference stops for that specific example. The idea of EENets is applicable to available CNN architectures such as ResNets. Through comprehensive experiments on MNIST, SVHN, CIFAR10 and Tiny-ImageNet datasets, we show that early-exit (EE) ResNets achieve similar accuracy with their non-EE versions while reducing the computational cost to 20% of the original. Code is available at this https URL

[AI-61] A Multi-Modal Deep Learning Based Approach for House Price Prediction

链接: https://arxiv.org/abs/2409.05335
作者: Md Hasebul Hasan,Md Abid Jahan,Mohammed Eunus Ali,Yuan-Fang Li,Timos Sellis
关键词-EN: house, house price, house price prediction, real estate sector, residential real estate
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 22 pages

点击查看摘要

Abstract:Accurate prediction of house price, a vital aspect of the residential real estate sector, is of substantial interest for a wide range of stakeholders. However, predicting house prices is a complex task due to the significant variability influenced by factors such as house features, location, neighborhood, and many others. Despite numerous attempts utilizing a wide array of algorithms, including recent deep learning techniques, to predict house prices accurately, existing approaches have fallen short of considering a wide range of factors such as textual and visual features. This paper addresses this gap by comprehensively incorporating attributes, such as features, textual descriptions, geo-spatial neighborhood, and house images, typically showcased in real estate listings in a house price prediction system. Specifically, we propose a multi-modal deep learning approach that leverages different types of data to learn more accurate representation of the house. In particular, we learn a joint embedding of raw house attributes, geo-spatial neighborhood, and most importantly from textual description and images representing the house; and finally use a downstream regression model to predict the house price from this jointly learned embedding vector. Our experimental results with a real-world dataset show that the text embedding of the house advertisement description and image embedding of the house pictures in addition to raw attributes and geo-spatial embedding, can significantly improve the house price prediction accuracy. The relevant source code and dataset are publicly accessible at the following URL: this https URL

[AI-62] Sample-Efficient Bayesian Optimization with Transfer Learning for Heterogeneous Search Spaces

链接: https://arxiv.org/abs/2409.05325
作者: Aryan Deshwal,Sait Cakmak,Yuhou Xia,David Eriksson
关键词-EN: Bayesian optimization, sample-efficient optimization, search spaces, Bayesian, black-box functions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Bayesian optimization (BO) is a powerful approach to sample-efficient optimization of black-box functions. However, in settings with very few function evaluations, a successful application of BO may require transferring information from historical experiments. These related experiments may not have exactly the same tunable parameters (search spaces), motivating the need for BO with transfer learning for heterogeneous search spaces. In this paper, we propose two methods for this setting. The first approach leverages a Gaussian process (GP) model with a conditional kernel to transfer information between different search spaces. Our second approach treats the missing parameters as hyperparameters of the GP model that can be inferred jointly with the other GP hyperparameters or set to fixed values. We show that these two methods perform well on several benchmark problems.

[AI-63] Machine Anomalous Sound Detection Using Spectral-temporal Modulation Representations Derived from Machine-specific Filterbanks

链接: https://arxiv.org/abs/2409.05319
作者: Kai Li,Khalid Zaman,Xingfeng Li,Masato Akagi,Masashi Unoki
关键词-EN: factory machinery malfunctions, Early detection, LNS feature, factory machinery, machinery malfunctions
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Early detection of factory machinery malfunctions is crucial in industrial applications. In machine anomalous sound detection (ASD), different machines exhibit unique vibration-frequency ranges based on their physical properties. Meanwhile, the human auditory system is adept at tracking both temporal and spectral dynamics of machine sounds. Consequently, integrating the computational auditory models of the human auditory system with machine-specific properties can be an effective approach to machine ASD. We first quantified the frequency importances of four types of machines using the Fisher ratio (F-ratio). The quantified frequency importances were then used to design machine-specific non-uniform filterbanks (NUFBs), which extract the log non-uniform spectrum (LNS) feature. The designed NUFBs have a narrower bandwidth and higher filter distribution density in frequency regions with relatively high F-ratios. Finally, spectral and temporal modulation representations derived from the LNS feature were proposed. These proposed LNS feature and modulation representations are input into an autoencoder neural-network-based detector for ASD. The quantification results from the training set of the Malfunctioning Industrial Machine Investigation and Inspection dataset with a signal-to-noise (SNR) of 6 dB reveal that the distinguishing information between normal and anomalous sounds of different machines is encoded non-uniformly in the frequency domain. By highlighting these important frequency regions using NUFBs, the LNS feature can significantly enhance performance using the metric of AUC (area under the receiver operating characteristic curve) under various SNR conditions. Furthermore, modulation representations can further improve performance. Specifically, temporal modulation is effective for fans, pumps, and sliders, while spectral modulation is particularly effective for valves.

[AI-64] -LLMs: A Series of Specialized Large Language Models for Telecommunications

链接: https://arxiv.org/abs/2409.05314
作者: Ali Maatouk,Kenny Chirino Ampudia,Rex Ying,Leandros Tassiulas
关键词-EN: natural language processing, impacted various fields, medicine and finance, emergence of large, significantly impacted
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The emergence of large language models (LLMs) has significantly impacted various fields, from natural language processing to sectors like medicine and finance. However, despite their rapid proliferation, the applications of LLMs in telecommunications remain limited, often relying on general-purpose models that lack domain-specific specialization. This lack of specialization results in underperformance, particularly when dealing with telecommunications-specific technical terminology and their associated mathematical representations. This paper addresses this gap by first creating and disseminating Tele-Data, a comprehensive dataset of telecommunications material curated from relevant sources, and Tele-Eval, a large-scale question-and-answer dataset tailored to the domain. Through extensive experiments, we explore the most effective training techniques for adapting LLMs to the telecommunications domain, ranging from examining the division of expertise across various telecommunications aspects to employing parameter-efficient techniques. We also investigate how models of different sizes behave during adaptation and analyze the impact of their training data on this behavior. Leveraging these findings, we develop and open-source Tele-LLMs, the first series of language models ranging from 1B to 8B parameters, specifically tailored for telecommunications. Our evaluations demonstrate that these models outperform their general-purpose counterparts on Tele-Eval while retaining their previously acquired capabilities, thus avoiding the catastrophic forgetting phenomenon.

[AI-65] Closed-Form Interpretation of Neural Network Latent Spaces with Symbolic Gradients

链接: https://arxiv.org/abs/2409.05305
作者: Zakaria Patel,Sebastian J. Wetzel
关键词-EN: neural networks, scientific fields, artificial neural networks, latent spaces, neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:It has been demonstrated in many scientific fields that artificial neural networks like autoencoders or Siamese networks encode meaningful concepts in their latent spaces. However, there does not exist a comprehensive framework for retrieving this information in a human-readable form without prior knowledge. In order to extract these concepts, we introduce a framework for finding closed-form interpretations of neurons in latent spaces of artificial neural networks. The interpretation framework is based on embedding trained neural networks into an equivalence class of functions that encode the same concept. We interpret these neural networks by finding an intersection between the equivalence class and human-readable equations defined by a symbolic search space. The approach is demonstrated by retrieving invariants of matrices and conserved quantities of dynamical systems from latent spaces of Siamese neural networks.

[AI-66] Resource-Efficient Generative AI Model Deployment in Mobile Edge Networks

链接: https://arxiv.org/abs/2409.05303
作者: Yuxin Liang,Peng Yang,Yuanyuan He,Feng Lyu
关键词-EN: Artificial Intelligence-Generated Content, development of Artificial, Artificial Intelligence-Generated, Intelligence-Generated Content, content creation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The surging development of Artificial Intelligence-Generated Content (AIGC) marks a transformative era of the content creation and production. Edge servers promise attractive benefits, e.g., reduced service delay and backhaul traffic load, for hosting AIGC services compared to cloud-based solutions. However, the scarcity of available resources on the edge pose significant challenges in deploying generative AI models. In this paper, by characterizing the resource and delay demands of typical generative AI models, we find that the consumption of storage and GPU memory, as well as the model switching delay represented by I/O delay during the preloading phase, are significant and vary across models. These multidimensional coupling factors render it difficult to make efficient edge model deployment decisions. Hence, we present a collaborative edge-cloud framework aiming to properly manage generative AI model deployment on the edge. Specifically, we formulate edge model deployment problem considering heterogeneous features of models as an optimization problem, and propose a model-level decision selection algorithm to solve it. It enables pooled resource sharing and optimizes the trade-off between resource consumption and delay in edge generative AI model deployment. Simulation results validate the efficacy of the proposed algorithm compared with baselines, demonstrating its potential to reduce overall costs by providing feature-aware model deployment decisions.

[AI-67] ERD: A Unified Framework for Safeguarding Diffusion Models Against Backdoors

链接: https://arxiv.org/abs/2409.05294
作者: Yichuan Mo,Hui Huang,Mingjie Li,Ang Li,Yisen Wang
关键词-EN: achieved notable success, remain highly vulnerable, producing specific undesirable, specific undesirable outputs, image generation
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have achieved notable success in image generation, but they remain highly vulnerable to backdoor attacks, which compromise their integrity by producing specific undesirable outputs when presented with a pre-defined trigger. In this paper, we investigate how to protect diffusion models from this dangerous threat. Specifically, we propose TERD, a backdoor defense framework that builds unified modeling for current attacks, which enables us to derive an accessible reversed loss. A trigger reversion strategy is further employed: an initial approximation of the trigger through noise sampled from a prior distribution, followed by refinement through differential multi-step samplers. Additionally, with the reversed trigger, we propose backdoor detection from the noise space, introducing the first backdoor input detection approach for diffusion models and a novel model detection algorithm that calculates the KL divergence between reversed and benign distributions. Extensive evaluations demonstrate that TERD secures a 100% True Positive Rate (TPR) and True Negative Rate (TNR) across datasets of varying resolutions. TERD also demonstrates nice adaptability to other Stochastic Differential Equation (SDE)-based models. Our code is available at this https URL.

[AI-68] Mpox Narrative on Instagram: A Labeled Multilingual Dataset of Instagram Posts on Mpox for Sentiment Hate Speech and Anxiety Analysis

链接: https://arxiv.org/abs/2409.05292
作者: Nirmalya Thakur
关键词-EN: Public Health Emergency, Public Health, Health Emergency, Emergency of International, International Concern
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:The world is currently experiencing an outbreak of mpox, which has been declared a Public Health Emergency of International Concern by WHO. No prior work related to social media mining has focused on the development of a dataset of Instagram posts about the mpox outbreak. The work presented in this paper aims to address this research gap and makes two scientific contributions to this field. First, it presents a multilingual dataset of 60,127 Instagram posts about mpox, published between July 23, 2022, and September 5, 2024. The dataset, available at this https URL, contains Instagram posts about mpox in 52 languages. For each of these posts, the Post ID, Post Description, Date of publication, language, and translated version of the post (translation to English was performed using the Google Translate API) are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis, hate speech detection, and anxiety or stress detection were performed. This process included classifying each post into (i) one of the sentiment classes, i.e., fear, surprise, joy, sadness, anger, disgust, or neutral, (ii) hate or not hate, and (iii) anxiety/stress detected or no anxiety/stress detected. These results are presented as separate attributes in the dataset. Second, this paper presents the results of performing sentiment analysis, hate speech analysis, and anxiety or stress analysis. The variation of the sentiment classes - fear, surprise, joy, sadness, anger, disgust, and neutral were observed to be 27.95%, 2.57%, 8.69%, 5.94%, 2.69%, 1.53%, and 50.64%, respectively. In terms of hate speech detection, 95.75% of the posts did not contain hate and the remaining 4.25% of the posts contained hate. Finally, 72.05% of the posts did not indicate any anxiety/stress, and the remaining 27.95% of the posts represented some form of anxiety/stress.

[AI-69] Seek and Solve Reasoning for Table Question Answering

链接: https://arxiv.org/abs/2409.05286
作者: Ruya Jiang,Chun Wang,Weihong Deng
关键词-EN: Table-based Question Answering, involves answering questions, answering questions based, Large Language Models, involves answering
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Table-based Question Answering (TQA) involves answering questions based on tabular data. The complexity of table structures and question logic makes this task difficult even for Large Language Models (LLMs). This paper improves TQA performance by leveraging LLMs’ reasoning capabilities. Inspired by how humans solve TQA tasks, we propose a Seek-and-Solve pipeline that instructs the LLM to first seek relevant information and then answer questions. The two stages are integrated at the reasoning level, and their Chain of Thought (CoT) paths are integrated into a coherent Seek-and-Solve CoT (SS-CoT). Furthermore, we present a compact single-stage TQA-solving prompt distilled from the pipeline. Experiments demonstrate that under In-Context Learning settings, using samples with SS-CoT paths as demonstrations, the TQA-solving prompt can effectively guide the LLM to solve complex TQA tasks, resulting in improved performance and reliability. Our results highlight the importance of properly eliciting LLMs’ reasoning capabilities in solving complex TQA tasks.

[AI-70] On the Relationship between Truth and Political Bias in Language Models

链接: https://arxiv.org/abs/2409.05283
作者: Suyash Fulay,William Brannon,Shrestha Mohanty,Cassandra Overney,Elinor Poole-Dayan,Deb Roy,Jad Kabbara
关键词-EN: Language model alignment, helpful and harmless, truthful and unbiased, research often attempts, attempts to ensure
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Language model alignment research often attempts to ensure that models are not only helpful and harmless, but also truthful and unbiased. However, optimizing these objectives simultaneously can obscure how improving one aspect might impact the others. In this work, we focus on analyzing the relationship between two concepts essential in both language model alignment and political science: \textittruthfulness and \textitpolitical bias. We train reward models on various popular truthfulness datasets and subsequently evaluate their political bias. Our findings reveal that optimizing reward models for truthfulness on these datasets tends to result in a left-leaning political bias. We also find that existing open-source reward models (i.e. those trained on standard human preference datasets) already show a similar bias and that the bias is larger for larger models. These results raise important questions about both the datasets used to represent truthfulness and what language models capture about the relationship between truth and politics.

[AI-71] RotCAtt-TransUNet: Novel Deep Neural Network for Sophisticated Cardiac Segmentation

链接: https://arxiv.org/abs/2409.05280
作者: Quoc-Bao Nguyen-Le,Tuan-Hy Le,Anh-Triet Do,Quoc-Huy Trinh
关键词-EN: Cardiovascular disease, global health concern, major global health, health concern, Cardiovascular
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 6 pages, 5 figures

点击查看摘要

Abstract:Cardiovascular disease is a major global health concern, contributing significantly to global mortality. Accurately segmenting cardiac medical imaging data is crucial for reducing fatality rates associated with these conditions. However, current state-of-the-art (SOTA) neural networks, including CNN-based and Transformer-based approaches, face challenges in capturing both inter-slice connections and intra-slice details, especially in datasets featuring intricate, long-range details along the z-axis like coronary arteries. Existing methods also struggle with differentiating non-cardiac components from the myocardium, resulting in segmentation inaccuracies and the “spraying” phenomenon. To address these issues, we introduce RotCAtt-TransUNet++, a novel architecture designed for robust segmentation of intricate cardiac structures. Our approach enhances global context modeling through multiscale feature aggregation and nested skip connections in the encoder. Transformer layers facilitate capturing intra-slice interactions, while a rotatory attention mechanism handles inter-slice connectivity. A channel-wise cross-attention gate integrates multiscale information and decoder features, effectively bridging semantic gaps. Experimental results across multiple datasets demonstrate superior performance over current methods, achieving near-perfect annotation of coronary arteries and myocardium. Ablation studies confirm that our rotatory attention mechanism significantly improves segmentation accuracy by transforming embedded vectorized patches in semantic dimensional space.

[AI-72] BrainDecoder: Style-Based Visual Decoding of EEG Signals

链接: https://arxiv.org/abs/2409.05279
作者: Minsuk Choi,Hiroshi Ishikawa
关键词-EN: offers valuable insights, visual stimuli, Decoding neural representations, visual decoding, offers valuable
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 5 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Decoding neural representations of visual stimuli from electroencephalography (EEG) offers valuable insights into brain activity and cognition. Recent advancements in deep learning have significantly enhanced the field of visual decoding of EEG, primarily focusing on reconstructing the semantic content of visual stimuli. In this paper, we present a novel visual decoding pipeline that, in addition to recovering the content, emphasizes the reconstruction of the style, such as color and texture, of images viewed by the subject. Unlike previous methods, this ``style-based’’ approach learns in the CLIP spaces of image and text separately, facilitating a more nuanced extraction of information from EEG signals. We also use captions for text alignment simpler than previously employed, which we find work better. Both quantitative and qualitative evaluations show that our method better preserves the style of visual stimuli and extracts more fine-grained semantic information from neural signals. Notably, it achieves significant improvements in quantitative results and sets a new state-of-the-art on the popular Brain2Image dataset.

[AI-73] Disentangled Representations for Short-Term and Long-Term Person Re-Identification

链接: https://arxiv.org/abs/2409.05277
作者: Chanho Eom,Wonkyung Lee,Geon Lee,Bumsub Ham
关键词-EN: person, retrieving person images, features, unrelated features, person images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:1910.12003

点击查看摘要

Abstract:We address the problem of person re-identification (reID), that is, retrieving person images from a large dataset, given a query image of the person of interest. A key challenge is to learn person representations robust to intra-class variations, as different persons could have the same attribute, and persons’ appearances look different, e.g., with viewpoint changes. Recent reID methods focus on learning person features discriminative only for a particular factor of variations (e.g., human pose), which also requires corresponding supervisory signals (e.g., pose annotations). To tackle this problem, we propose to factorize person images into identity-related and unrelated features. Identity-related features contain information useful for specifying a particular person (e.g., clothing), while identity-unrelated ones hold other factors (e.g., human pose). To this end, we propose a new generative adversarial network, dubbed identity shuffle GAN (IS-GAN). It disentangles identity-related and unrelated features from person images through an identity-shuffling technique that exploits identification labels alone without any auxiliary supervisory signals. We restrict the distribution of identity-unrelated features or encourage the identity-related and unrelated features to be uncorrelated, facilitating the disentanglement process. Experimental results validate the effectiveness of IS-GAN, showing state-of-the-art performance on standard reID benchmarks, including Market-1501, CUHK03, and DukeMTMC-reID. We further demonstrate the advantages of disentangling person representations on a long-term reID task, setting a new state of the art on a Celeb-reID dataset.

[AI-74] Learning Submodular Sequencing from Samples

链接: https://arxiv.org/abs/2409.05265
作者: Jing Yuan,Shaojie Tang
关键词-EN: sequential submodular maximization, paper addresses, addresses the problem, problem of sequential, optimize some composite
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper addresses the problem of sequential submodular maximization: selecting and ranking items in a sequence to optimize some composite submodular function. In contrast to most of the previous works, which assume access to the utility function, we assume that we are given only a set of samples. Each sample includes a random sequence of items and its associated utility. We present an algorithm that, given polynomially many samples drawn from a two-stage uniform distribution, achieves an approximation ratio dependent on the curvature of individual submodular functions. Our results apply in a wide variety of real-world scenarios, such as ranking products in online retail platforms, where complete knowledge of the utility function is often impossible to obtain. Our algorithm gives an empirically useful solution in such contexts, thus proving that limited data can be of great use in sequencing tasks. From a technical perspective, our results extend prior work on ``optimization from samples’’ by generalizing from optimizing a set function to a sequence-dependent function.

[AI-75] owards Automated Machine Learning Research

链接: https://arxiv.org/abs/2409.05258
作者: Shervin Ardeshir
关键词-EN: Large Language Models, Large Language, automating incremental advances, machine learning research, facilitated by Large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper explores a top-down approach to automating incremental advances in machine learning research through component-level innovation, facilitated by Large Language Models (LLMs). Our framework systematically generates novel components, validates their feasibility, and evaluates their performance against existing baselines. A key distinction of this approach lies in how these novel components are generated. Unlike traditional AutoML and NAS methods, which often rely on a bottom-up combinatorial search over predefined, hardcoded base components, our method leverages the cross-domain knowledge embedded in LLMs to propose new components that may not be confined to any hard-coded predefined set. By incorporating a reward model to prioritize promising hypotheses, we aim to improve the efficiency of the hypothesis generation and evaluation process. We hope this approach offers a new avenue for exploration and contributes to the ongoing dialogue in the field.

[AI-76] FedFT: Improving Communication Performance for Federated Learning with Frequency Space Transformation

链接: https://arxiv.org/abs/2409.05242
作者: Chamath Palihawadana,Nirmalie Wiratunga,Anjana Wijekoon,Harsha Kalutarage
关键词-EN: widely recognised research, recognised research problem, recent work focused, Federated Learning, model parameters
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Communication efficiency is a widely recognised research problem in Federated Learning (FL), with recent work focused on developing techniques for efficient compression, distribution and aggregation of model parameters between clients and the server. Particularly within distributed systems, it is important to balance the need for computational cost and communication efficiency. However, existing methods are often constrained to specific applications and are less generalisable. In this paper, we introduce FedFT (federated frequency-space transformation), a simple yet effective methodology for communicating model parameters in a FL setting. FedFT uses Discrete Cosine Transform (DCT) to represent model parameters in frequency space, enabling efficient compression and reducing communication overhead. FedFT is compatible with various existing FL methodologies and neural architectures, and its linear property eliminates the need for multiple transformations during federated aggregation. This methodology is vital for distributed solutions, tackling essential challenges like data privacy, interoperability, and energy efficiency inherent to these environments. We demonstrate the generalisability of the FedFT methodology on four datasets using comparative studies with three state-of-the-art FL baselines (FedAvg, FedProx, FedSim). Our results demonstrate that using FedFT to represent the differences in model parameters between communication rounds in frequency space results in a more compact representation compared to representing the entire model in frequency space. This leads to a reduction in communication overhead, while keeping accuracy levels comparable and in some cases even improving it. Our results suggest that this reduction can range from 5% to 30% per client, depending on dataset.

[AI-77] Synthetic Tabular Data Generation for Class Imbalance and Fairness: A Comparative Study ECML KDD2024

链接: https://arxiv.org/abs/2409.05215
作者: Emmanouil Panagiotou,Arjun Roy,Eirini Ntoutsi
关键词-EN: Machine Learning, data-driven nature, group imbalances, class and group, group
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at the ECML PKDD 2024, 4th Workshop on Bias and Fairness in AI

点击查看摘要

Abstract:Due to their data-driven nature, Machine Learning (ML) models are susceptible to bias inherited from data, especially in classification problems where class and group imbalances are prevalent. Class imbalance (in the classification target) and group imbalance (in protected attributes like sex or race) can undermine both ML utility and fairness. Although class and group imbalances commonly coincide in real-world tabular datasets, limited methods address this scenario. While most methods use oversampling techniques, like interpolation, to mitigate imbalances, recent advancements in synthetic tabular data generation offer promise but have not been adequately explored for this purpose. To this end, this paper conducts a comparative analysis to address class and group imbalances using state-of-the-art models for synthetic tabular data generation and various sampling strategies. Experimental results on four datasets, demonstrate the effectiveness of generative models for bias mitigation, creating opportunities for further exploration in this direction.

[AI-78] ICML Topological Deep Learning Challenge 2024: Beyond the Graph Domain ICML2024

链接: https://arxiv.org/abs/2409.05211
作者: Guillermo Bernárdez,Lev Telyatnikov,Marco Montagna,Federica Baccini,Mathilde Papillon,Miquel Ferriol-Galmés,Mustafa Hajij,Theodore Papamarkou,Maria Sofia Bucarelli,Olga Zaghen,Johan Mathe,Audun Myers,Scott Mahan,Hansen Lillemark,Sharvaree Vadgama,Erik Bekkers,Tim Doster,Tegan Emerson,Henry Kvinge,Katrina Agate,Nesreen K Ahmed,Pengfei Bai,Michael Banf,Claudio Battiloro,Maxim Beketov,Paul Bogdan,Martin Carrasco,Andrea Cavallo,Yun Young Choi,George Dasoulas,Matouš Elphick,Giordan Escalona,Dominik Filipiak,Halley Fritze,Thomas Gebhart,Manel Gil-Sorribes,Salvish Goomanee,Victor Guallar,Liliya Imasheva,Andrei Irimia,Hongwei Jin,Graham Johnson,Nikos Kanakaris,Boshko Koloski,Veljko Kovač,Manuel Lecha,Minho Lee,Pierrick Leroy,Theodore Long,German Magai,Alvaro Martinez,Marissa Masden,Sebastian Mežnar,Bertran Miquel-Oliver,Alexis Molina,Alexander Nikitin,Marco Nurisso,Matt Piekenbrock,Yu Qin,Patryk Rygiel,Alessandro Salatiello,Max Schattauer,Pavel Snopov,Julian Suk,Valentina Sánchez,Mauricio Tec,Francesco Vaccarino,Jonas Verhellen,Frederic Wantiez,Alexander Weers,Patrik Zajec,Blaž Škrlj,Nina Miolane
关键词-EN: Geometry-grounded Representation Learning, ICML Topological Deep, Topological Deep Learning, Deep Learning Challenge, ELLIS Workshop
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Proceedings of the Geometry-grounded Representation Learning and Generative Modeling Workshop (GRaM) at ICML 2024

点击查看摘要

Abstract:This paper describes the 2nd edition of the ICML Topological Deep Learning Challenge that was hosted within the ICML 2024 ELLIS Workshop on Geometry-grounded Representation Learning and Generative Modeling (GRaM). The challenge focused on the problem of representing data in different discrete topological domains in order to bridge the gap between Topological Deep Learning (TDL) and other types of structured datasets (e.g. point clouds, graphs). Specifically, participants were asked to design and implement topological liftings, i.e. mappings between different data structures and topological domains --like hypergraphs, or simplicial/cell/combinatorial complexes. The challenge received 52 submissions satisfying all the requirements. This paper introduces the main scope of the challenge, and summarizes the main results and findings.

[AI-79] Influence-based Attributions can be Manipulated

链接: https://arxiv.org/abs/2409.05208
作者: Chhavi Yadav,Ruihan Wu,Kamalika Chaudhuri
关键词-EN: Influence Functions, valuation and fairness, standard tool, tool for attributing, attributing predictions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Influence Functions are a standard tool for attributing predictions to training data in a principled manner and are widely used in applications such as data valuation and fairness. In this work, we present realistic incentives to manipulate influencebased attributions and investigate whether these attributions can be systematically tampered by an adversary. We show that this is indeed possible and provide efficient attacks with backward-friendly implementations. Our work raises questions on the reliability of influence-based attributions under adversarial circumstances.

[AI-80] SEF: A Method for Computing Prediction Intervals by Shifting the Error Function in Neural Networks

链接: https://arxiv.org/abs/2409.05206
作者: E. V. Aretos,D. G. Sotiropoulos
关键词-EN: Neural Networks, neural network predictions, today era, scientific fields, Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: The paper has been accepted at the 2024 International Conference on Computer and Applications (ICCA24), Cairo, Egypt, December 17-19, 2024. this https URL

点击查看摘要

Abstract:In today’s era, Neural Networks (NN) are applied in various scientific fields such as robotics, medicine, engineering, etc. However, the predictions of neural networks themselves contain a degree of uncertainty that must always be taken into account before any decision is made. This is why many researchers have focused on developing different ways to quantify the uncertainty of neural network predictions. Some of these methods are based on generating prediction intervals (PI) via neural networks for the requested target values. The SEF (Shifting the Error Function) method presented in this paper is a new method that belongs to this category of methods. The proposed approach involves training a single neural network three times, thus generating an estimate along with the corresponding upper and lower bounds for a given problem. A pivotal aspect of the method is the calculation of a parameter from the initial network’s estimates, which is then integrated into the loss functions of the other two networks. This innovative process effectively produces PIs, resulting in a robust and efficient technique for uncertainty quantification. To evaluate the effectiveness of our method, a comparison in terms of successful PI generation between the SEF, PI3NN and PIVEN methods was made using two synthetic datasets.

[AI-81] A Survey on Mixup Augmentations and Beyond

链接: https://arxiv.org/abs/2409.05202
作者: Xin Jin,Hongyu Zhu,Siyuan Li,Zedong Wang,Zicheng Liu,Chang Yu,Huafeng Qin,Stan Z. Li
关键词-EN: Deep Neural Networks, Deep Neural, Neural Networks, achieved thrilling breakthroughs, garnered increasing attention
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint V1 with 27 pages main text. Online project at this https URL

点击查看摘要

Abstract:As Deep Neural Networks have achieved thrilling breakthroughs in the past decade, data augmentations have garnered increasing attention as regularization techniques when massive labeled data are unavailable. Among existing augmentations, Mixup and relevant data-mixing methods that convexly combine selected samples and the corresponding labels are widely adopted because they yield high performances by generating data-dependent virtual data while easily migrating to various domains. This survey presents a comprehensive review of foundational mixup methods and their applications. We first elaborate on the training pipeline with mixup augmentations as a unified framework containing modules. A reformulated framework could contain various mixup methods and give intuitive operational procedures. Then, we systematically investigate the applications of mixup augmentations on vision downstream tasks, various data modalities, and some analysis \ theorems of mixup. Meanwhile, we conclude the current status and limitations of mixup research and point out further work for effective and efficient mixup augmentations. This survey can provide researchers with the current state of the art in mixup methods and provide some insights and guidance roles in the mixup arena. An online project with this survey is available at \urlthis https URL.

[AI-82] Seemingly Plausible Distractors in Multi-Hop Reasoning: Are Large Language Models Attentive Readers?

链接: https://arxiv.org/abs/2409.05197
作者: Neeladri Bhuiya,Viktor Schlegel,Stefan Winkler
关键词-EN: Large Language Models, possessing scientific knowledge, Large Language, multi-hop reasoning, ranging from reading
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 16 pages, 3 figures

点击查看摘要

Abstract:State-of-the-art Large Language Models (LLMs) are accredited with an increasing number of different capabilities, ranging from reading comprehension, over advanced mathematical and reasoning skills to possessing scientific knowledge. In this paper we focus on their multi-hop reasoning capability: the ability to identify and integrate information from multiple textual sources. Given the concerns with the presence of simplifying cues in existing multi-hop reasoning benchmarks, which allow models to circumvent the reasoning requirement, we set out to investigate, whether LLMs are prone to exploiting such simplifying cues. We find evidence that they indeed circumvent the requirement to perform multi-hop reasoning, but they do so in more subtle ways than what was reported about their fine-tuned pre-trained language model (PLM) predecessors. Motivated by this finding, we propose a challenging multi-hop reasoning benchmark, by generating seemingly plausible multi-hop reasoning chains, which ultimately lead to incorrect answers. We evaluate multiple open and proprietary state-of-the-art LLMs, and find that their performance to perform multi-hop reasoning is affected, as indicated by up to 45% relative decrease in F1 score when presented with such seemingly plausible alternatives. We conduct a deeper analysis and find evidence that while LLMs tend to ignore misleading lexical cues, misleading reasoning paths indeed present a significant challenge. Comments: 16 pages, 3 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) ACMclasses: I.2.7 Cite as: arXiv:2409.05197 [cs.CL] (or arXiv:2409.05197v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.05197 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-83] Insights from Benchmarking Frontier Language Models on Web App Code Generation

链接: https://arxiv.org/abs/2409.05177
作者: Yi Cui
关键词-EN: frontier large language, paper presents insights, test suite designed, generate web application, large language models
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents insights from evaluating 16 frontier large language models (LLMs) on the WebApp1K benchmark, a test suite designed to assess the ability of LLMs to generate web application code. The results reveal that while all models possess similar underlying knowledge, their performance is differentiated by the frequency of mistakes they make. By analyzing lines of code (LOC) and failure distributions, we find that writing correct code is more complex than generating incorrect code. Furthermore, prompt engineering shows limited efficacy in reducing errors beyond specific cases. These findings suggest that further advancements in coding LLM should emphasize on model reliability and mistake minimization.

[AI-84] OneGen: Efficient One-Pass Unified Generation and Retrieval for LLMs

链接: https://arxiv.org/abs/2409.05152
作者: Jintian Zhang,Cheng Peng,Mengshu Sun,Xiang Chen,Lei Liang,Zhiqiang Zhang,Jun Zhou,Huajun Chen,Ningyu Zhang
关键词-EN: Large Language Models, Language Models, Large Language, advancements in Large, directly handling retrieval
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Work in progress; code is available at this https URL

点击查看摘要

Abstract:Despite the recent advancements in Large Language Models (LLMs), which have significantly enhanced the generative capabilities for various NLP tasks, LLMs still face limitations in directly handling retrieval tasks. However, many practical applications demand the seamless integration of both retrieval and generation. This paper introduces a novel and efficient One-pass Generation and retrieval framework (OneGen), designed to improve LLMs’ performance on tasks that require both generation and retrieval. The proposed framework bridges the traditionally separate training approaches for generation and retrieval by incorporating retrieval tokens generated autoregressively. This enables a single LLM to handle both tasks simultaneously in a unified forward pass. We conduct experiments on two distinct types of composite tasks, RAG and Entity Linking, to validate the pluggability, effectiveness, and efficiency of OneGen in training and inference. Furthermore, our results show that integrating generation and retrieval within the same context preserves the generative capabilities of LLMs while improving retrieval performance. To the best of our knowledge, OneGen is the first to enable LLMs to conduct vector retrieval during the generation.

[AI-85] EdaCSC: Two Easy Data Augmentation Methods for Chinese Spelling Correction

链接: https://arxiv.org/abs/2409.05105
作者: Lei Sheng,Shuai-Shuai Xu
关键词-EN: Chinese Spelling Correction, correct spelling errors, Spelling Correction, Chinese sentences caused, Chinese Spelling
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 18 pages, 2 figures

点击查看摘要

Abstract:Chinese Spelling Correction (CSC) aims to detect and correct spelling errors in Chinese sentences caused by phonetic or visual similarities. While current CSC models integrate pinyin or glyph features and have shown significant progress,they still face challenges when dealing with sentences containing multiple typos and are susceptible to overcorrection in real-world scenarios. In contrast to existing model-centric approaches, we propose two data augmentation methods to address these limitations. Firstly, we augment the dataset by either splitting long sentences into shorter ones or reducing typos in sentences with multiple typos. Subsequently, we employ different training processes to select the optimal model. Experimental evaluations on the SIGHAN benchmarks demonstrate the superiority of our approach over most existing models, achieving state-of-the-art performance on the SIGHAN15 test set.

[AI-86] Adaptive k-nearest neighbor classifier based on the local estimation of the shape operator

链接: https://arxiv.org/abs/2409.05084
作者: Alexandre Luís Magalhães Levada,Frank Nielsen,Michel Ferreira Cardia Haddad
关键词-EN: nonparametric classification, local Gaussian curvature, nearest neighbor, local, popular methods
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注: 18 pages, 4 figures

点击查看摘要

Abstract:The k -nearest neighbor ( k -NN) algorithm is one of the most popular methods for nonparametric classification. However, a relevant limitation concerns the definition of the number of neighbors k . This parameter exerts a direct impact on several properties of the classifier, such as the bias-variance tradeoff, smoothness of decision boundaries, robustness to noise, and class imbalance handling. In the present paper, we introduce a new adaptive k -nearest neighbours ( kK -NN) algorithm that explores the local curvature at a sample to adaptively defining the neighborhood size. The rationale is that points with low curvature could have larger neighborhoods (locally, the tangent space approximates well the underlying data shape), whereas points with high curvature could have smaller neighborhoods (locally, the tangent space is a loose approximation). We estimate the local Gaussian curvature by computing an approximation to the local shape operator in terms of the local covariance matrix as well as the local Hessian matrix. Results on many real-world datasets indicate that the new kK -NN algorithm yields superior balanced accuracy compared to the established k -NN method and also another adaptive k -NN algorithm. This is particularly evident when the number of samples in the training data is limited, suggesting that the kK -NN is capable of learning more discriminant functions with less data considering many relevant cases.

[AI-87] PIP: Detecting Adversarial Examples in Large Vision-Language Models via Attention Patterns of Irrelevant Probe Questions

链接: https://arxiv.org/abs/2409.05076
作者: Yudong Zhang,Ruobing Xie,Jiansheng Chen,Xingwu Sun,Yu Wang
关键词-EN: Large Vision-Language Models, powerful multimodal capabilities, Large Vision-Language, Vision-Language Models, multimodal capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by ACM Multimedia 2024 BNI track (Oral)

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have demonstrated their powerful multimodal capabilities. However, they also face serious safety problems, as adversaries can induce robustness issues in LVLMs through the use of well-designed adversarial examples. Therefore, LVLMs are in urgent need of detection tools for adversarial examples to prevent incorrect responses. In this work, we first discover that LVLMs exhibit regular attention patterns for clean images when presented with probe questions. We propose an unconventional method named PIP, which utilizes the attention patterns of one randomly selected irrelevant probe question (e.g., “Is there a clock?”) to distinguish adversarial examples from clean examples. Regardless of the image to be tested and its corresponding question, PIP only needs to perform one additional inference of the image to be tested and the probe question, and then achieves successful detection of adversarial examples. Even under black-box attacks and open dataset scenarios, our PIP, coupled with a simple SVM, still achieves more than 98% recall and a precision of over 90%. Our PIP is the first attempt to detect adversarial attacks on LVLMs via simple irrelevant probe questions, shedding light on deeper understanding and introspection within LVLMs. The code is available at this https URL.

[AI-88] Dynamic Demand Management for Parcel Lockers

链接: https://arxiv.org/abs/2409.05061
作者: Daniela Sailer,Robert Klein,Claudius Steinhardt
关键词-EN: parcel delivery landscape, cost-efficient last mile, sustainable and cost-efficient, gained a firm, firm foothold
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In pursuit of a more sustainable and cost-efficient last mile, parcel lockers have gained a firm foothold in the parcel delivery landscape. To fully exploit their potential and simultaneously ensure customer satisfaction, successful management of the locker’s limited capacity is crucial. This is challenging as future delivery requests and pickup times are stochastic from the provider’s perspective. In response, we propose to dynamically control whether the locker is presented as an available delivery option to each incoming customer with the goal of maximizing the number of served requests weighted by their priority. Additionally, we take different compartment sizes into account, which entails a second type of decision as parcels scheduled for delivery must be allocated. We formalize the problem as an infinite-horizon sequential decision problem and find that exact methods are intractable due to the curses of dimensionality. In light of this, we develop a solution framework that orchestrates multiple algorithmic techniques rooted in Sequential Decision Analytics and Reinforcement Learning, namely cost function approximation and an offline trained parametric value function approximation together with a truncated online rollout. Our innovative approach to combine these techniques enables us to address the strong interrelations between the two decision types. As a general methodological contribution, we enhance the training of our value function approximation with a modified version of experience replay that enforces structure in the value function. Our computational study shows that our method outperforms a myopic benchmark by 13.7% and an industry-inspired policy by 12.6%.

[AI-89] Deep Generic Representations for Domain-Generalized Anomalous Sound Detection

链接: https://arxiv.org/abs/2409.05035
作者: Phurich Saengthong,Takahiro Shinozaki
关键词-EN: anomalous sound detection, reliable anomalous sound, Developing a reliable, system requires robustness, sound detection
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Developing a reliable anomalous sound detection (ASD) system requires robustness to noise, adaptation to domain shifts, and effective performance with limited training data. Current leading methods rely on extensive labeled data for each target machine type to train feature extractors using Outlier-Exposure (OE) techniques, yet their performance on the target domain remains sub-optimal. In this paper, we present \textitGenRep, which utilizes generic feature representations from a robust, large-scale pre-trained feature extractor combined with kNN for domain-generalized ASD, without the need for fine-tuning. \textitGenRep incorporates MemMixup, a simple approach for augmenting the target memory bank using nearest source samples, paired with a domain normalization technique to address the imbalance between source and target domains. \textitGenRep outperforms the best OE-based approach without a need for labeled data with an Official Score of 73.79% on the DCASE2023T2 Eval set and demonstrates robustness under limited data scenarios. The code is available open-source.

[AI-90] A Survey on Diffusion Models for Recommender Systems

链接: https://arxiv.org/abs/2409.05033
作者: Jianghao Lin,Jiaqi Liu,Jiachen Zhu,Yunjia Xi,Chengkai Liu,Yangtian Zhang,Yong Yu,Weinan Zhang
关键词-EN: inadequate collaborative signals, made significant strides, limited generalization performance, generalization performance caused, diffusion models
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Under Review

点击查看摘要

Abstract:While traditional recommendation techniques have made significant strides in the past decades, they still suffer from limited generalization performance caused by factors like inadequate collaborative signals, weak latent representations, and noisy data. In response, diffusion models (DMs) have emerged as promising solutions for recommender systems due to their robust generative capabilities, solid theoretical foundations, and improved training stability. To this end, in this paper, we present the first comprehensive survey on diffusion models for recommendation, and draw a bird’s-eye view from the perspective of the whole pipeline in real-world recommender systems. We systematically categorize existing research works into three primary domains: (1) diffusion for data engineering encoding, focusing on data augmentation and representation enhancement; (2) diffusion as recommender models, employing diffusion models to directly estimate user preferences and rank items; and (3) diffusion for content presentation, utilizing diffusion models to generate personalized content such as fashion and advertisement creatives. Our taxonomy highlights the unique strengths of diffusion models in capturing complex data distributions and generating high-quality, diverse samples that closely align with user preferences. We also summarize the core characteristics of the adapting diffusion models for recommendation, and further identify key areas for future exploration, which helps establish a roadmap for researchers and practitioners seeking to advance recommender systems through the innovative application of diffusion models. To further facilitate the research community of recommender systems based on diffusion models, we actively maintain a GitHub repository for papers and other related resources in this rising direction this https URL.

[AI-91] Sequential Recommendation via Adaptive Robust Attention with Multi-dimensional Embeddings

链接: https://arxiv.org/abs/2409.05022
作者: Linsey Pang,Amir Hossein Raffiee,Wei Liu,Keld Lundgaard
关键词-EN: Abstract, Sequential recommendation, Sequential, achieved, significant accuracy boost
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sequential recommendation models have achieved state-of-the-art performance using self-attention mechanism. It has since been found that moving beyond only using item ID and positional embeddings leads to a significant accuracy boost when predicting the next item. In recent literature, it was reported that a multi-dimensional kernel embedding with temporal contextual kernels to capture users’ diverse behavioral patterns results in a substantial performance improvement. In this study, we further improve the sequential recommender model’s robustness and generalization by introducing a mix-attention mechanism with a layer-wise noise injection (LNI) regularization. We refer to our proposed model as adaptive robust sequential recommendation framework (ADRRec), and demonstrate through extensive experiments that our model outperforms existing self-attention architectures.

[AI-92] Audio-Guided Fusion Techniques for Multimodal Emotion Analysis

链接: https://arxiv.org/abs/2409.05007
作者: Pujin Shi,Fei Gao
关键词-EN: semi-supervised learning track, text feature extractors, semi-supervised learning, tasks,we fine-tuned video, sentiment classification tasks,we
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:In this paper, we propose a solution for the semi-supervised learning track (MER-SEMI) in MER2024. First, in order to enhance the performance of the feature extractor on sentiment classification tasks,we fine-tuned video and text feature extractors, specifically CLIP-vit-large and Baichuan-13B, using labeled data. This approach effectively preserves the original emotional information conveyed in the videos. Second, we propose an Audio-Guided Transformer (AGT) fusion mechanism, which leverages the robustness of Hubert-large, showing superior effectiveness in fusing both inter-channel and intra-channel information. Third, To enhance the accuracy of the model, we iteratively apply self-supervised learning by using high-confidence unlabeled data as pseudo-labels. Finally, through black-box probing, we discovered an imbalanced data distribution between the training and test sets. Therefore, We adopt a prior-knowledge-based voting mechanism. The results demonstrate the effectiveness of our strategy, ultimately earning us third place in the MER-SEMI track.

[AI-93] A Pair Programming Framework for Code Generation via Multi-Plan Exploration and Feedback-Driven Refinement

链接: https://arxiv.org/abs/2409.05001
作者: Huan Zhang,Wei Cheng,Yuhan Wu,Wei Hu
关键词-EN: Large language models, achieved impressive performance, Large language, language models, achieved impressive
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Accepted in the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE 2024)

点击查看摘要

Abstract:Large language models (LLMs) have achieved impressive performance on code generation. Although prior studies enhanced LLMs with prompting techniques and code refinement, they still struggle with complex programming problems due to rigid solution plans. In this paper, we draw on pair programming practices to propose PairCoder, a novel LLM-based framework for code generation. PairCoder incorporates two collaborative LLM agents, namely a Navigator agent for high-level planning and a Driver agent for specific implementation. The Navigator is responsible for proposing promising solution plans, selecting the current optimal plan, and directing the next iteration round based on execution feedback. The Driver follows the guidance of Navigator to undertake initial code generation, code testing, and refinement. This interleaved and iterative workflow involves multi-plan exploration and feedback-based refinement, which mimics the collaboration of pair programmers. We evaluate PairCoder with both open-source and closed-source LLMs on various code generation benchmarks. Extensive experimental results demonstrate the superior accuracy of PairCoder, achieving relative pass@1 improvements of 12.00%-162.43% compared to prompting LLMs directly.

[AI-94] Enhancing Convolutional Neural Networks with Higher-Order Numerical Difference Methods

链接: https://arxiv.org/abs/2409.04977
作者: Qi Wang,Zijun Gao,Mingxiu Sui,Taiyuan Mei,Xiaohan Cheng,Iris Li
关键词-EN: Convolutional Neural Networks, Convolutional Neural, deep learning technology, practical applications, real-world problems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the rise of deep learning technology in practical applications, Convolutional Neural Networks (CNNs) have been able to assist humans in solving many real-world problems. To enhance the performance of CNNs, numerous network architectures have been explored. Some of these architectures are designed based on the accumulated experience of researchers over time, while others are designed through neural architecture search methods. The improvements made to CNNs by the aforementioned methods are quite significant, but most of the improvement methods are limited in reality by model size and environmental constraints, making it difficult to fully realize the improved performance. In recent years, research has found that many CNN structures can be explained by the discretization of ordinary differential equations. This implies that we can design theoretically supported deep network structures using higher-order numerical difference methods. It should be noted that most of the previous CNN model structures are based on low-order numerical methods. Therefore, considering that the accuracy of linear multi-step numerical difference methods is higher than that of the forward Euler method, this paper proposes a stacking scheme based on the linear multi-step method. This scheme enhances the performance of ResNet without increasing the model size and compares it with the Runge-Kutta scheme. The experimental results show that the performance of the stacking scheme proposed in this paper is superior to existing stacking schemes (ResNet and HO-ResNet), and it has the capability to be extended to other types of neural networks.

[AI-95] HYDRA: Hybrid Data Multiplexing and Run-time Layer Configurable DNN Accelerator

链接: https://arxiv.org/abs/2409.04976
作者: Sonu Kumar,Komal Gupta,Gopal Raut,Mukul Lokhande,Santosh Kumar Vishvakarma
关键词-EN: Deep neural networks, Deep neural, executing efficient computation, neural networks, offer plenty
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) offer plenty of challenges in executing efficient computation at edge nodes, primarily due to the huge hardware resource demands. The article proposes HYDRA, hybrid data multiplexing, and runtime layer configurable DNN accelerators to overcome the drawbacks. The work proposes a layer-multiplexed approach, which further reuses a single activation function within the execution of a single layer with improved Fused-Multiply-Accumulate (FMA). The proposed approach works in iterative mode to reuse the same hardware and execute different layers in a configurable fashion. The proposed architectures achieve reductions over 90% of power consumption and resource utilization improvements of state-of-the-art works, with 35.21 TOPSW. The proposed architecture reduces the area overhead (N-1) times required in bandwidth, AF and layer architecture. This work shows HYDRA architecture supports optimal DNN computations while improving performance on resource-constrained edge devices.

[AI-96] Soft Actor-Critic with Beta Policy via Implicit Reparameterization Gradients

链接: https://arxiv.org/abs/2409.04971
作者: Luca Della Libera
关键词-EN: poor sample efficiency, sample efficiency remains, deep reinforcement learning, Recent advances, achieved impressive results
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages

点击查看摘要

Abstract:Recent advances in deep reinforcement learning have achieved impressive results in a wide range of complex tasks, but poor sample efficiency remains a major obstacle to real-world deployment. Soft actor-critic (SAC) mitigates this problem by combining stochastic policy optimization and off-policy learning, but its applicability is restricted to distributions whose gradients can be computed through the reparameterization trick. This limitation excludes several important examples such as the beta distribution, which was shown to improve the convergence rate of actor-critic algorithms in high-dimensional continuous control problems thanks to its bounded support. To address this issue, we investigate the use of implicit reparameterization, a powerful technique that extends the class of reparameterizable distributions. In particular, we use implicit reparameterization gradients to train SAC with the beta policy on simulated robot locomotion environments and compare its performance with common baselines. Experimental results show that the beta policy is a viable alternative, as it outperforms the normal policy and is on par with the squashed normal policy, which is the go-to choice for SAC. The code is available at this https URL.

[AI-97] Evaluation of Google Translate for Mandarin Chinese translation using sentiment and semantic analysis

链接: https://arxiv.org/abs/2409.04964
作者: Xuechun Wang,Rodney Beard,Rohitash Chandra
关键词-EN: significant global impact, making communication easier, Google Translate, global impact, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Machine translation using large language models (LLMs) is having a significant global impact, making communication easier. Mandarin Chinese is the official language used for communication by the government, education institutes, and media in China. In this study, we provide an automated assessment of machine translation models with human experts using sentiment and semantic analysis. In order to demonstrate our framework, we select classic early twentieth-century novel ‘The True Story of Ah Q’ with selected Mandarin Chinese to English translations. We also us Google Translate to generate the given text into English and then conduct a chapter-wise sentiment analysis and semantic analysis to compare the extracted sentiments across the different translations. We utilise LLMs for semantic and sentiment analysis. Our results indicate that the precision of Google Translate differs both in terms of semantic and sentiment analysis when compared to human expert translations. We find that Google Translate is unable to translate some of the specific words or phrases in Chinese, such as Chinese traditional allusions. The mistranslations have to its lack of contextual significance and historical knowledge of China. Thus, this framework brought us some new insights about machine translation for Chinese Mandarin. The future work can explore other languages or types of texts with this framework.

[AI-98] DDNet: Deformable Convolution and Dense FPN for Surface Defect Detection in Recycled Books

链接: https://arxiv.org/abs/2409.04958
作者: Jun Yu,WenJian Wang
关键词-EN: second-hand goods market, worth largely dependent, reused textbooks, hold significant, goods market
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recycled and recirculated books, such as ancient texts and reused textbooks, hold significant value in the second-hand goods market, with their worth largely dependent on surface preservation. However, accurately assessing surface defects is challenging due to the wide variations in shape, size, and the often imprecise detection of defects. To address these issues, we propose DDNet, an innovative detection model designed to enhance defect localization and classification. DDNet introduces a surface defect feature extraction module based on a deformable convolution operator (DC) and a densely connected FPN module (DFPN). The DC module dynamically adjusts the convolution grid to better align with object contours, capturing subtle shape variations and improving boundary delineation and prediction accuracy. Meanwhile, DFPN leverages dense skip connections to enhance feature fusion, constructing a hierarchical structure that generates multi-resolution, high-fidelity feature maps, thus effectively detecting defects of various sizes. In addition to the model, we present a comprehensive dataset specifically curated for surface defect detection in recycled and recirculated books. This dataset encompasses a diverse range of defect types, shapes, and sizes, making it ideal for evaluating the robustness and effectiveness of defect detection models. Through extensive evaluations, DDNet achieves precise localization and classification of surface defects, recording a mAP value of 46.7% on our proprietary dataset - an improvement of 14.2% over the baseline model - demonstrating its superior detection capabilities.

[AI-99] Evaluating Neural Networks Architectures for Spring Reverb Modelling

链接: https://arxiv.org/abs/2409.04953
作者: Francesco Papaleo,Xavier Lizarraga-Seijas,Frederic Font
关键词-EN: Virtual Analogue Modelling, Virtual Analogue, digital signal processing, signal processing techniques, spatial audio perception
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
*备注: 8 pages, 7 figures, 2 tables

点击查看摘要

Abstract:Reverberation is a key element in spatial audio perception, historically achieved with the use of analogue devices, such as plate and spring reverb, and in the last decades with digital signal processing techniques that have allowed different approaches for Virtual Analogue Modelling (VAM). The electromechanical functioning of the spring reverb makes it a nonlinear system that is difficult to fully emulate in the digital domain with white-box modelling techniques. In this study, we compare five different neural network architectures, including convolutional and recurrent models, to assess their effectiveness in replicating the characteristics of this audio effect. The evaluation is conducted on two datasets at sampling rates of 16 kHz and 48 kHz. This paper specifically focuses on neural audio architectures that offer parametric control, aiming to advance the boundaries of current black-box modelling techniques in the domain of spring reverberation.

[AI-100] Attention-Based Efficient Breath Sound Removal in Studio Audio Recordings

链接: https://arxiv.org/abs/2409.04949
作者: Nidula Elgiriyewithana,N. D.Kodikara
关键词-EN: attention U-Net architecture, non-speech vocal sounds, specifically breath sounds, vocal recordings, non-speech vocal
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:In this research, we present an innovative, parameter-efficient model that utilizes the attention U-Net architecture for the automatic detection and eradication of non-speech vocal sounds, specifically breath sounds, in vocal recordings. This task is of paramount importance in the field of sound engineering, despite being relatively under-explored. The conventional manual process for detecting and eliminating these sounds requires significant expertise and is extremely time-intensive. Existing automated detection and removal methods often fall short in terms of efficiency and precision. Our proposed model addresses these limitations by offering a streamlined process and superior accuracy, achieved through the application of advanced deep learning techniques. A unique dataset, derived from Device and Produced Speech (DAPS), was employed for this purpose. The training phase of the model emphasizes a log spectrogram and integrates an early stopping mechanism to prevent overfitting. Our model not only conserves precious time for sound engineers but also enhances the quality and consistency of audio production. This constitutes a significant breakthrough, as evidenced by its comparative efficiency, necessitating only 1.9M parameters and a training duration of 3.2 hours - markedly less than the top-performing models in this domain. The model is capable of generating identical outputs as previous models with drastically improved precision, making it an optimal choice.

[AI-101] Maximizing Relation Extraction Potential: A Data-Centric Study to Unveil Challenges and Opportunities

链接: https://arxiv.org/abs/2409.04934
作者: Anushka Swarup,Avanti Bhandarkar,Olivia P. Dizon-Paradis,Ronald Wilson,Damon L. Woodard
关键词-EN: Natural Language Processing, Processing task aiming, Language Processing task, Processing task, Natural Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Relation extraction is a Natural Language Processing task aiming to extract relationships from textual data. It is a critical step for information extraction. Due to its wide-scale applicability, research in relation extraction has rapidly scaled to using highly advanced neural networks. Despite their computational superiority, modern relation extractors fail to handle complicated extraction scenarios. However, a comprehensive performance analysis of the state-of-the-art relation extractors that compile these challenges has been missing from the literature, and this paper aims to bridge this gap. The goal has been to investigate the possible data-centric characteristics that impede neural relation extraction. Based on extensive experiments conducted using 15 state-of-the-art relation extraction algorithms ranging from recurrent architectures to large language models and seven large-scale datasets, this research suggests that modern relation extractors are not robust to complex data and relation characteristics. It emphasizes pivotal issues, such as contextual ambiguity, correlating relations, long-tail data, and fine-grained relation distributions. In addition, it sets a marker for future directions to alleviate these issues, thereby proving to be a critical resource for novice and advanced researchers. Efficient handling of the challenges described can have significant implications for the field of information extraction, which is a critical part of popular systems such as search engines and chatbots. Data and relevant code can be found at this https URL.

[AI-102] A Delta-evaluation function for column permutation problems

链接: https://arxiv.org/abs/2409.04926
作者: Júnior R. Lima,Viníicius Gandra M. Santos,Marco Antonio M. Carvalho
关键词-EN: sparse binary matrix, column permutation problem, permutation problem defined, consecutive ones property, introduced for solving
类目: Artificial Intelligence (cs.AI); Combinatorics (math.CO); Optimization and Control (math.OC)
*备注: technical report

点击查看摘要

Abstract:In this study, a new \Delta -evaluation method is introduced for solving a column permutation problem defined on a sparse binary matrix with the consecutive ones property. This problem models various \mathcalNP -hard problems in graph theory and industrial manufacturing contexts. The computational experiments compare the processing time of the \Delta -evaluation method with two other methods used in well-known local search procedures. The study considers a comprehensive set of instances of well-known problems, such as Gate Matrix Layout and Minimization of Open Stacks. The proposed evaluation method is generally competitive and particularly useful for large and dense instances. It can be easily integrated into local search and metaheuristic algorithms to improve solutions without significantly increasing processing time.

[AI-103] MoistNet: Machine Vision-based Deep Learning Models for Wood Chip Moisture Content Measurement

链接: https://arxiv.org/abs/2409.04920
作者: Abdur Rahman,Jason Street,James Wooten,Mohammad Marufuzzaman,Veera G. Gude,Randy Buchanan,Haifeng Wang
关键词-EN: numerous forest-reliant industries, Quick and reliable, moisture content, chip moisture content, wood chip moisture
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Quick and reliable measurement of wood chip moisture content is an everlasting problem for numerous forest-reliant industries such as biofuel, pulp and paper, and bio-refineries. Moisture content is a critical attribute of wood chips due to its direct relationship with the final product quality. Conventional techniques for determining moisture content, such as oven-drying, possess some drawbacks in terms of their time-consuming nature, potential sample damage, and lack of real-time feasibility. Furthermore, alternative techniques, including NIR spectroscopy, electrical capacitance, X-rays, and microwaves, have demonstrated potential; nevertheless, they are still constrained by issues related to portability, precision, and the expense of the required equipment. Hence, there is a need for a moisture content determination method that is instant, portable, non-destructive, inexpensive, and precise. This study explores the use of deep learning and machine vision to predict moisture content classes from RGB images of wood chips. A large-scale image dataset comprising 1,600 RGB images of wood chips has been collected and annotated with ground truth labels, utilizing the results of the oven-drying technique. Two high-performing neural networks, MoistNetLite and MoistNetMax, have been developed leveraging Neural Architecture Search (NAS) and hyperparameter optimization. The developed models are evaluated and compared with state-of-the-art deep learning models. Results demonstrate that MoistNetLite achieves 87% accuracy with minimal computational overhead, while MoistNetMax exhibits exceptional precision with a 91% accuracy in wood chip moisture content class prediction. With improved accuracy and faster prediction speed, our proposed MoistNet models hold great promise for the wood chip processing industry.

[AI-104] Activation Function Optimization Scheme for Image Classification

链接: https://arxiv.org/abs/2409.04915
作者: Abdur Rahman,Lu He,Haifeng Wang
关键词-EN: activation functions, Activation, functions, significant impact, Error Linear Unit
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Activation function has a significant impact on the dynamics, convergence, and performance of deep neural networks. The search for a consistent and high-performing activation function has always been a pursuit during deep learning model development. Existing state-of-the-art activation functions are manually designed with human expertise except for Swish. Swish was developed using a reinforcement learning-based search strategy. In this study, we propose an evolutionary approach for optimizing activation functions specifically for image classification tasks, aiming to discover functions that outperform current state-of-the-art options. Through this optimization framework, we obtain a series of high-performing activation functions denoted as Exponential Error Linear Unit (EELU). The developed activation functions are evaluated for image classification tasks from two perspectives: (1) five state-of-the-art neural network architectures, such as ResNet50, AlexNet, VGG16, MobileNet, and Compact Convolutional Transformer which cover computationally heavy to light neural networks, and (2) eight standard datasets, including CIFAR10, Imagenette, MNIST, Fashion MNIST, Beans, Colorectal Histology, CottonWeedID15, and TinyImageNet which cover from typical machine vision benchmark, agricultural image applications to medical image applications. Finally, we statistically investigate the generalization of the resultant activation functions developed through the optimization scheme. With a Friedman test, we conclude that the optimization scheme is able to generate activation functions that outperform the existing standard ones in 92.8% cases among 28 different cases studied, and -x\cdot erf(e^-x) is found to be the best activation function for image classification generated by the optimization scheme.

[AI-105] Reinforcement Learning-Based Adaptive Load Balancing for Dynamic Cloud Environments

链接: https://arxiv.org/abs/2409.04896
作者: Kavish Chawla
关键词-EN: Efficient load balancing, cloud computing environments, prevent server overload, Efficient load, load balancing
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注: Length: 6 pages (including references) Figures: 3 figures Submission Type: Conference paper Keywords: Reinforcement Learning, Load Balancing, Cloud Computing, Adaptive Algorithms, AI-driven Load Management

点击查看摘要

Abstract:Efficient load balancing is crucial in cloud computing environments to ensure optimal resource utilization, minimize response times, and prevent server overload. Traditional load balancing algorithms, such as round-robin or least connections, are often static and unable to adapt to the dynamic and fluctuating nature of cloud workloads. In this paper, we propose a novel adaptive load balancing framework using Reinforcement Learning (RL) to address these challenges. The RL-based approach continuously learns and improves the distribution of tasks by observing real-time system performance and making decisions based on traffic patterns and resource availability. Our framework is designed to dynamically reallocate tasks to minimize latency and ensure balanced resource usage across servers. Experimental results show that the proposed RL-based load balancer outperforms traditional algorithms in terms of response time, resource utilization, and adaptability to changing workloads. These findings highlight the potential of AI-driven solutions for enhancing the efficiency and scalability of cloud infrastructures.

[AI-106] Defeasible Reasoning on Concepts

链接: https://arxiv.org/abs/2409.04887
作者: Yiwen Ding,Krishna Manoorkar,Ni Wayan Switrayni,Ruoding Wang
关键词-EN: KLM framework, developing defeasible reasoning, concepts in KLM, steps toward developing, developing defeasible
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:In this paper, we take first steps toward developing defeasible reasoning on concepts in KLM framework. We define generalizations of cumulative reasoning system C and cumulative reasoning system with loop CL to conceptual setting. We also generalize cumulative models, cumulative ordered models, and preferential models to conceptual setting and show the soundness and completeness results for these models.

[AI-107] Learning to Open and Traverse Doors with a Legged Manipulator

链接: https://arxiv.org/abs/2409.04882
作者: Mike Zhang,Yuntao Ma,Takahiro Miki,Marco Hutter
关键词-EN: significant practical interest, giving robots greater, robots greater access, human-centric spaces, longstanding challenge
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Using doors is a longstanding challenge in robotics and is of significant practical interest in giving robots greater access to human-centric spaces. The task is challenging due to the need for online adaptation to varying door properties and precise control in manipulating the door panel and navigating through the confined doorway. To address this, we propose a learning-based controller for a legged manipulator to open and traverse through doors. The controller is trained using a teacher-student approach in simulation to learn robust task behaviors as well as estimate crucial door properties during the interaction. Unlike previous works, our approach is a single control policy that can handle both push and pull doors through learned behaviour which infers the opening direction during deployment without prior knowledge. The policy was deployed on the ANYmal legged robot with an arm and achieved a success rate of 95.0% in repeated trials conducted in an experimental setting. Additional experiments validate the policy’s effectiveness and robustness to various doors and disturbances. A video overview of the method and experiments can be found at this http URL.

[AI-108] owards identifying Source credibility on Information Leakage in Digital Gadget Market

链接: https://arxiv.org/abs/2409.04880
作者: Neha Kumaru,Garvit Gupta,Shreyas Mongia,Shubham Singh,Ponnurangam Kumaraguru,Arun Balaji Buduru
关键词-EN: Social media, Social media includes, constant rise, share content, digital gadget market
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The use of Social media to share content is on a constant rise. One of the capsize effect of information sharing on Social media includes the spread of sensitive information on the public domain. With the digital gadget market becoming highly competitive and ever-evolving, the trend of an increasing number of sensitive posts leaking information on devices in social media is observed. Many web-blogs on digital gadget market have mushroomed recently, making the problem of information leak all pervasive. Credible leaks on specifics of an upcoming device can cause a lot of financial damage to the respective organization. Hence, it is crucial to assess the credibility of the platforms that continuously post about a smartphone or digital gadget leaks. In this work, we analyze the headlines of leak web-blog posts and their corresponding official press-release. We first collect 54, 495 leak and press-release headlines for different smartphones. We train our custom NER model to capture the evolving smartphone names with an accuracy of 82.14% on manually annotated results. We further propose a credibility score metric for the web-blog, based on the number of falsified and authentic smartphone leak posts.

[AI-109] FedModule: A Modular Federated Learning Framework

链接: https://arxiv.org/abs/2409.04849
作者: Chuyi Chen,Zhe Zhang,Yanchao Zhao
关键词-EN: smart cities, widely adopted, Federated learning, complex experimental scenarios, experimental scenarios
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated learning (FL) has been widely adopted across various applications, such as healthcare, finance, and smart cities. However, as experimental scenarios become more complex, existing FL frameworks and benchmarks have struggled to keep pace. This paper introduces FedModule, a flexible and extensible FL experimental framework that has been open-sourced to support diverse FL paradigms and provide comprehensive benchmarks for complex experimental scenarios. FedModule adheres to the “one code, all scenarios” principle and employs a modular design that breaks the FL process into individual components, allowing for the seamless integration of different FL paradigms. The framework supports synchronous, asynchronous, and personalized federated learning, with over 20 implemented algorithms. Experiments conducted on public datasets demonstrate the flexibility and extensibility of FedModule. The framework offers multiple execution modes-including linear, threaded, process-based, and distributed-enabling users to tailor their setups to various experimental needs. Additionally, FedModule provides extensive logging and testing capabilities, which facilitate detailed performance analysis of FL algorithms. Comparative evaluations against existing FL toolkits, such as TensorFlow Federated, PySyft, Flower, and FLGo, highlight FedModule’s superior scalability, flexibility, and comprehensive benchmark support. By addressing the limitations of current FL frameworks, FedModule marks a significant advancement in FL experimentation, providing researchers and practitioners with a robust tool for developing and evaluating FL algorithms across a wide range of scenarios.

[AI-110] Sample- and Oracle-Efficient Reinforcement Learning for MDPs with Linearly-Realizable Value Functions

链接: https://arxiv.org/abs/2409.04840
作者: Zakaria Mhammedi
关键词-EN: feasible reinforcement learning, Markov Decision Processes, Designing sample-efficient, computationally feasible reinforcement, reinforcement learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Designing sample-efficient and computationally feasible reinforcement learning (RL) algorithms is particularly challenging in environments with large or infinite state and action spaces. In this paper, we advance this effort by presenting an efficient algorithm for Markov Decision Processes (MDPs) where the state-action value function of any policy is linear in a given feature map. This challenging setting can model environments with infinite states and actions, strictly generalizes classic linear MDPs, and currently lacks a computationally efficient algorithm under online access to the MDP. Specifically, we introduce a new RL algorithm that efficiently finds a near-optimal policy in this setting, using a number of episodes and calls to a cost-sensitive classification (CSC) oracle that are both polynomial in the problem parameters. Notably, our CSC oracle can be efficiently implemented when the feature dimension is constant, representing a clear improvement over state-of-the-art methods, which require solving non-convex problems with horizon-many variables and can incur computational costs that are \emphexponential in the horizon.

[AI-111] Reducing Events to Augment Log-based Anomaly Detection Models: An Empirical Study

链接: https://arxiv.org/abs/2409.04834
作者: Lingzhe Zhang,Tong Jia,Kangjin Wang,Mengxi Jia,Yang Yong,Ying Li
关键词-EN: grow increasingly intricate, systems grow increasingly, anomaly detection, increasingly intricate, essential and challenging
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Accepted By ESEM’24

点击查看摘要

Abstract:As software systems grow increasingly intricate, the precise detection of anomalies have become both essential and challenging. Current log-based anomaly detection methods depend heavily on vast amounts of log data leading to inefficient inference and potential misguidance by noise logs. However, the quantitative effects of log reduction on the effectiveness of anomaly detection remain unexplored. Therefore, we first conduct a comprehensive study on six distinct models spanning three datasets. Through the study, the impact of log quantity and their effectiveness in representing anomalies is qualifies, uncovering three distinctive log event types that differently influence model performance. Drawing from these insights, we propose LogCleaner: an efficient methodology for the automatic reduction of log events in the context of anomaly detection. Serving as middleware between software systems and models, LogCleaner continuously updates and filters anti-events and duplicative-events in the raw generated logs. Experimental outcomes highlight LogCleaner’s capability to reduce over 70% of log events in anomaly detection, accelerating the model’s inference speed by approximately 300%, and universally improving the performance of models for anomaly detection.

[AI-112] Achieving Peak Performance for Large Language Models : A Systematic Review

链接: https://arxiv.org/abs/2409.04833
作者: Zhyar Rzgar K Rostam,Sándor Szénási,Gábor Kertész
关键词-EN: achieved remarkable success, natural language processing, achieved remarkable, remarkable success, success in natural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 34 pages, 7 figures, 8 tables. Journal Article: IEEE Access

点击查看摘要

Abstract:In recent years, large language models (LLMs) have achieved remarkable success in natural language processing (NLP). LLMs require an extreme amount of parameters to attain high performance. As models grow into the trillion-parameter range, computational and memory costs increase significantly. This makes it difficult for many researchers to access the resources needed to train or apply these models. Optimizing LLM performance involves two main approaches: fine-tuning pre-trained models for specific tasks to achieve state-of-the-art performance, and reducing costs or improving training time while maintaining similar performance. This paper presents a systematic literature review (SLR) following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement. We reviewed 65 publications out of 983 from 2017 to December 2023, retrieved from 5 databases. The study presents methods to optimize and accelerate LLMs while achieving cutting-edge results without sacrificing accuracy. We begin with an overview of the development of language modeling, followed by a detailed explanation of commonly used frameworks and libraries, and a taxonomy for improving and speeding up LLMs based on three classes: LLM training, LLM inference, and system serving. We then delve into recent optimization and acceleration strategies such as training optimization, hardware optimization, scalability and reliability, accompanied by the taxonomy and categorization of these strategies. Finally, we provide an in-depth comparison of each class and strategy, with two case studies on optimizing model training and enhancing inference efficiency. These case studies showcase practical approaches to address LLM resource limitations while maintaining performance.

[AI-113] Reward-Directed Score-Based Diffusion Models via q-Learning

链接: https://arxiv.org/abs/2409.04832
作者: Xuefeng Gao,Jiale Zha,Xun Yu Zhou
关键词-EN: maximize reward functions, generated distributions close, unknown target data, training continuous-time score-based, target data distributions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We propose a new reinforcement learning (RL) formulation for training continuous-time score-based diffusion models for generative AI to generate samples that maximize reward functions while keeping the generated distributions close to the unknown target data distributions. Different from most existing studies, our formulation does not involve any pretrained model for the unknown score functions of the noise-perturbed data distributions. We present an entropy-regularized continuous-time RL problem and show that the optimal stochastic policy has a Gaussian distribution with a known covariance matrix. Based on this result, we parameterize the mean of Gaussian policies and develop an actor-critic type (little) q-learning algorithm to solve the RL problem. A key ingredient in our algorithm design is to obtain noisy observations from the unknown score function via a ratio estimator. Numerically, we show the effectiveness of our approach by comparing its performance with two state-of-the-art RL methods that fine-tune pretrained models. Finally, we discuss extensions of our RL formulation to probability flow ODE implementation of diffusion models and to conditional diffusion models.

[AI-114] MILE: A Mutation Testing Framework of In-Context Learning Systems

链接: https://arxiv.org/abs/2409.04831
作者: Zeming Wei,Yihao Zhang,Meng Sun
关键词-EN: achieved notable success, large language models, achieved notable, notable success, applications of large
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In-context Learning (ICL) has achieved notable success in the applications of large language models (LLMs). By adding only a few input-output pairs that demonstrate a new task, the LLM can efficiently learn the task during inference without modifying the model parameters. Such mysterious ability of LLMs has attracted great research interests in understanding, formatting, and improving the in-context demonstrations, while still suffering from drawbacks like black-box mechanisms and sensitivity against the selection of examples. In this work, inspired by the foundations of adopting testing techniques in machine learning (ML) systems, we propose a mutation testing framework designed to characterize the quality and effectiveness of test data for ICL systems. First, we propose several mutation operators specialized for ICL demonstrations, as well as corresponding mutation scores for ICL test sets. With comprehensive experiments, we showcase the effectiveness of our framework in evaluating the reliability and quality of ICL test suites. Our code is available at this https URL.

[AI-115] NASH: Neural Architecture and Accelerator Search for Multiplication-Reduced Hybrid Models

链接: https://arxiv.org/abs/2409.04829
作者: Yang Xu,Huihong Shi,Zhongfeng Wang
关键词-EN: significant computational cost, deep neural networks, uparrow, edge devices, Search
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The significant computational cost of multiplications hinders the deployment of deep neural networks (DNNs) on edge devices. While multiplication-free models offer enhanced hardware efficiency, they typically sacrifice accuracy. As a solution, multiplication-reduced hybrid models have emerged to combine the benefits of both approaches. Particularly, prior works, i.e., NASA and NASA-F, leverage Neural Architecture Search (NAS) to construct such hybrid models, enhancing hardware efficiency while maintaining accuracy. However, they either entail costly retraining or encounter gradient conflicts, limiting both search efficiency and accuracy. Additionally, they overlook the acceleration opportunity introduced by accelerator search, yielding sub-optimal hardware performance. To overcome these limitations, we propose NASH, a Neural architecture and Accelerator Search framework for multiplication-reduced Hybrid models. Specifically, as for NAS, we propose a tailored zero-shot metric to pre-identify promising hybrid models before training, enhancing search efficiency while alleviating gradient conflicts. Regarding accelerator search, we innovatively introduce coarse-to-fine search to streamline the search process. Furthermore, we seamlessly integrate these two levels of searches to unveil NASH, obtaining the optimal model and accelerator pairing. Experiments validate our effectiveness, e.g., when compared with the state-of-the-art multiplication-based system, we can achieve \uparrow 2.14\times throughput and \uparrow 2.01\times FPS with \uparrow 0.25% accuracy on CIFAR-100, and \uparrow 1.40\times throughput and \uparrow 1.19\times FPS with \uparrow 0.56% accuracy on Tiny-ImageNet. Codes are available at \urlthis https URL.

[AI-116] POINTS: Improving Your Vision-language Model with Affordable Strategies

链接: https://arxiv.org/abs/2409.04828
作者: Yuan Liu,Zhongyin Zhao,Ziyuan Zhuang,Le Tian,Xiao Zhou,Jie Zhou
关键词-EN: made significant strides, optical character recognition, significant strides, excelling in tasks, geometric problem-solving
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:In recent years, vision-language models have made significant strides, excelling in tasks like optical character recognition and geometric problem-solving. However, several critical issues remain: 1) Proprietary models often lack transparency about their architectures, while open-source models need more detailed ablations of their training strategies. 2) Pre-training data in open-source works is under-explored, with datasets added empirically, making the process cumbersome. 3) Fine-tuning often focuses on adding datasets, leading to diminishing returns. To address these issues, we propose the following contributions: 1) We trained a robust baseline model using the latest advancements in vision-language models, introducing effective improvements and conducting comprehensive ablation and validation for each technique. 2) Inspired by recent work on large language models, we filtered pre-training data using perplexity, selecting the lowest perplexity data for training. This approach allowed us to train on a curated 1M dataset, achieving competitive performance. 3) During visual instruction tuning, we used model soup on different datasets when adding more datasets yielded marginal improvements. These innovations resulted in a 9B parameter model that performs competitively with state-of-the-art models. Our strategies are efficient and lightweight, making them easily adoptable by the community.

[AI-117] Exploring Straightforward Conversational Red-Teaming

链接: https://arxiv.org/abs/2409.04822
作者: George Kour,Naama Zwerdling,Marcel Zalmanovici,Ateret Anaby-Tavor,Ora Nova Fandina,Eitan Farchi
关键词-EN: Large language models, business dialogue systems, Large language, ethical risks, business dialogue
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in business dialogue systems but they pose security and ethical risks. Multi-turn conversations, where context influences the model’s behavior, can be exploited to produce undesired responses. In this paper, we examine the effectiveness of utilizing off-the-shelf LLMs in straightforward red-teaming approaches, where an attacker LLM aims to elicit undesired output from a target LLM, comparing both single-turn and conversational red-teaming tactics. Our experiments offer insights into various usage strategies that significantly affect their performance as red teamers. They suggest that off-the-shelf models can act as effective red teamers and even adjust their attack strategy based on past attempts, although their effectiveness decreases with greater alignment.

[AI-118] FreeAugment: Data Augmentation Search Across All Degrees of Freedom ECCV2024

链接: https://arxiv.org/abs/2409.04820
作者: Tom Bekor,Niv Nayman,Lihi Zelnik-Manor
关键词-EN: automatic data augmentation, data augmentation search, Data augmentation, neural networks, integral part
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:Data augmentation has become an integral part of deep learning, as it is known to improve the generalization capabilities of neural networks. Since the most effective set of image transformations differs between tasks and domains, automatic data augmentation search aims to alleviate the extreme burden of manually finding the optimal image transformations. However, current methods are not able to jointly optimize all degrees of freedom: (1) the number of transformations to be applied, their (2) types, (3) order, and (4) magnitudes. Many existing methods risk picking the same transformation more than once, limit the search to two transformations only, or search for the number of transformations exhaustively or iteratively in a myopic manner. Our approach, FreeAugment, is the first to achieve global optimization of all four degrees of freedom simultaneously, using a fully differentiable method. It efficiently learns the number of transformations and a probability distribution over their permutations, inherently refraining from redundant repetition while sampling. Our experiments demonstrate that this joint learning of all degrees of freedom significantly improves performance, achieving state-of-the-art results on various natural image benchmarks and beyond across other domains. Project page at this https URL

[AI-119] op-GAP: Integrating Size Priors in CNNs for more Interpretability Robustness and Bias Mitigation ECCV2024

链接: https://arxiv.org/abs/2409.04819
作者: Lars Nieradzik,Henrike Stephani,Janis Keuper
关键词-EN: convolutional neural networks, paper introduces Top-GAP, Effective Receptive Field, paper introduces, regularization technique
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: eXCV Workshop at ECCV 2024

点击查看摘要

Abstract:This paper introduces Top-GAP, a novel regularization technique that enhances the explainability and robustness of convolutional neural networks. By constraining the spatial size of the learned feature representation, our method forces the network to focus on the most salient image regions, effectively reducing background influence. Using adversarial attacks and the Effective Receptive Field, we show that Top-GAP directs more attention towards object pixels rather than the background. This leads to enhanced interpretability and robustness. We achieve over 50% robust accuracy on CIFAR-10 with PGD \epsilon=\frac8255 and 20 iterations while maintaining the original clean accuracy. Furthermore, we see increases of up to 5% accuracy against distribution shifts. Our approach also yields more precise object localization, as evidenced by up to 25% improvement in Intersection over Union (IOU) compared to methods like GradCAM and Recipro-CAM.

[AI-120] Generalized Learning of Coefficients in Spectral Graph Convolutional Networks

链接: https://arxiv.org/abs/2409.04813
作者: Mustafa Coşkun,Ananth Grama,Mehmet Koyutürk
关键词-EN: Spectral Graph Convolutional, Graph Convolutional Networks, Convolutional Networks, Spectral Graph, Graph Convolutional
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Spectral Graph Convolutional Networks (GCNs) have gained popularity in graph machine learning applications due, in part, to their flexibility in specification of network propagation rules. These propagation rules are often constructed as polynomial filters whose coefficients are learned using label information during training. In contrast to learned polynomial filters, explicit filter functions are useful in capturing relationships between network topology and distribution of labels across the network. A number of algorithms incorporating either approach have been proposed; however the relationship between filter functions and polynomial approximations is not fully resolved. This is largely due to the ill-conditioned nature of the linear systems that must be solved to derive polynomial approximations of filter functions. To address this challenge, we propose a novel Arnoldi orthonormalization-based algorithm, along with a unifying approach, called G-Arnoldi-GCN that can efficiently and effectively approximate a given filter function with a polynomial. We evaluate G-Arnoldi-GCN in the context of multi-class node classification across ten datasets with diverse topological characteristics. Our experiments show that G-Arnoldi-GCN consistently outperforms state-of-the-art methods when suitable filter functions are employed. Overall, G-Arnoldi-GCN opens important new directions in graph machine learning by enabling the explicit design and application of diverse filter functions. Code link: https://anonymous.4open.science/r/GArnoldi-GCN-F7E2/README.md

[AI-121] HULLMI: Human vs LLM identification with explainability

链接: https://arxiv.org/abs/2409.04808
作者: Prathamesh Dinesh Joshi,Sahil Pocker,Raj Abhijit Dandekar,Rajat Dandekar,Sreedath Panat
关键词-EN: producing human-like responses, modern NLP detectors, industrial pursuits dedicated, modern NLP, NLP detectors
类目: Artificial Intelligence (cs.AI)
*备注: 17 pages, 8 figures

点击查看摘要

Abstract:As LLMs become increasingly proficient at producing human-like responses, there has been a rise of academic and industrial pursuits dedicated to flagging a given piece of text as “human” or “AI”. Most of these pursuits involve modern NLP detectors like T5-Sentinel and RoBERTa-Sentinel, without paying too much attention to issues of interpretability and explainability of these models. In our study, we provide a comprehensive analysis that shows that traditional ML models (Naive-Bayes,MLP, Random Forests, XGBoost) perform as well as modern NLP detectors, in human vs AI text detection. We achieve this by implementing a robust testing procedure on diverse datasets, including curated corpora and real-world samples. Subsequently, by employing the explainable AI technique LIME, we uncover parts of the input that contribute most to the prediction of each model, providing insights into the detection process. Our study contributes to the growing need for developing production-level LLM detection tools, which can leverage a wide range of traditional as well as modern NLP detectors we propose. Finally, the LIME techniques we demonstrate also have the potential to equip these detection tools with interpretability analysis features, making them more reliable and trustworthy in various domains like education, healthcare, and media.

[AI-122] Phrase-Level Adversarial Training for Mitigating Bias in Neural Network-based Automatic Essay Scoring

链接: https://arxiv.org/abs/2409.04795
作者: Haddad Philip,Tsegaye Misikir Tashu
关键词-EN: Automatic Essay Scoring, Automatic Essay, educational purposes, candidates for educational, AES
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automatic Essay Scoring (AES) is widely used to evaluate candidates for educational purposes. However, due to the lack of representative data, most existing AES systems are not robust, and their scoring predictions are biased towards the most represented data samples. In this study, we propose a model-agnostic phrase-level method to generate an adversarial essay set to address the biases and robustness of AES models. Specifically, we construct an attack test set comprising samples from the original test set and adversarially generated samples using our proposed method. To evaluate the effectiveness of the attack strategy and data augmentation, we conducted a comprehensive analysis utilizing various neural network scoring models. Experimental results show that the proposed approach significantly improves AES model performance in the presence of adversarial examples and scenarios without such attacks.

[AI-123] Action is the primary key: a categorical framework for episode description and logical reasoning

链接: https://arxiv.org/abs/2409.04793
作者: Yoshiki Fukada
关键词-EN: logical reasoning, research presents, presents a computational, describing and recognizing, computational framework
类目: Artificial Intelligence (cs.AI)
*备注: 26 pages, 18 figures, 4 tables

点击查看摘要

Abstract:This research presents a computational framework for describing and recognizing episodes and for logical reasoning. This framework, named cognitive-logs, consists of a set of relational and graph databases. Cognitive-logs record knowledge, particularly in episodes that consist of “actions” represented by verbs in natural languages and “participants” who perform the actions. These objects are connected by arrows (morphisms) that link each action to its participant and link cause to effect. Operations based on category theory enable comparisons between episodes and deductive inferences, including abstractions of stories. One of the goals of this study is to develop a database-driven artificial intelligence. This artificial intelligence thinks like a human but possesses the accuracy and rigour of a machine. The vast capacities of databases (up to petabyte scales in current technologies) enable the artificial intelligence to store a greater volume of knowledge than neural-network based artificial intelligences. Cognitive-logs serve as a model of human cognition and designed with references to cognitive linguistics. Cognitive-logs also have the potential to model various human mind activities.

[AI-124] Improving Deep Reinforcement Learning by Reducing the Chain Effect of Value and Policy Churn

链接: https://arxiv.org/abs/2409.04792
作者: Hongyao Tang,Glen Berseth
关键词-EN: Deep neural networks, provide Reinforcement Learning, networks provide Reinforcement, large-scale decision-making problems, address large-scale decision-making
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep neural networks provide Reinforcement Learning (RL) powerful function approximators to address large-scale decision-making problems. However, these approximators introduce challenges due to the non-stationary nature of RL training. One source of the challenges in RL is that output predictions can churn, leading to uncontrolled changes after each batch update for states not included in the batch. Although such a churn phenomenon exists in each step of network training, how churn occurs and impacts RL remains under-explored. In this work, we start by characterizing churn in a view of Generalized Policy Iteration with function approximation, and we discover a chain effect of churn that leads to a cycle where the churns in value estimation and policy improvement compound and bias the learning dynamics throughout the iteration. Further, we concretize the study and focus on the learning issues caused by the chain effect in different settings, including greedy action deviation in value-based methods, trust region violation in proximal policy optimization, and dual bias of policy value in actor-critic methods. We then propose a method to reduce the chain effect across different settings, called Churn Approximated ReductIoN (CHAIN), which can be easily plugged into most existing DRL algorithms. Our experiments demonstrate the effectiveness of our method in both reducing churn and improving learning performance across online and offline, value-based and policy-based RL settings, as well as a scaling setting.

[AI-125] Selective Self-Rehearsal: A Fine-Tuning Approach to Improve Generalization in Large Language Models

链接: https://arxiv.org/abs/2409.04787
作者: Sonam Gupta,Yatin Nandwani,Asaf Yehudai,Mayank Mishra,Gaurav Pandey,Dinesh Raghu,Sachindra Joshi
关键词-EN: Fine-tuning Large Language, Large Language Models, Large Language, Fine-tuning Large, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 14 pages, 8 figures

点击查看摘要

Abstract:Fine-tuning Large Language Models (LLMs) on specific datasets is a common practice to improve performance on target tasks. However, this performance gain often leads to overfitting, where the model becomes too specialized in either the task or the characteristics of the training data, resulting in a loss of generalization. This paper introduces Selective Self-Rehearsal (SSR), a fine-tuning approach that achieves performance comparable to the standard supervised fine-tuning (SFT) while improving generalization. SSR leverages the fact that there can be multiple valid responses to a query. By utilizing the model’s correct responses, SSR reduces model specialization during the fine-tuning stage. SSR first identifies the correct model responses from the training set by deploying an appropriate LLM as a judge. Then, it fine-tunes the model using the correct model responses and the gold response for the remaining samples. The effectiveness of SSR is demonstrated through experiments on the task of identifying unanswerable queries across various datasets. The results show that standard SFT can lead to an average performance drop of up to 16.7% on multiple benchmarks, such as MMLU and TruthfulQA. In contrast, SSR results in close to 2% drop on average, indicating better generalization capabilities compared to standard SFT.

[AI-126] Leveraging LLMs Graphs and Object Hierarchies for Task Planning in Large-Scale Environments

链接: https://arxiv.org/abs/2409.04775
作者: Rodrigo Pérez-Dattari,Zhaoting Li,Robert Babuška,Jens Kober,Cosimo Della Santina
关键词-EN: Planning methods struggle, solving task-level problems, methods struggle, struggle with computational, computational intractability
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:Planning methods struggle with computational intractability in solving task-level problems in large-scale environments. This work explores leveraging the commonsense knowledge encoded in LLMs to empower planning techniques to deal with these complex scenarios. We achieve this by efficiently using LLMs to prune irrelevant components from the planning problem’s state space, substantially simplifying its complexity. We demonstrate the efficacy of this system through extensive experiments within a household simulation environment, alongside real-world validation using a 7-DoF manipulator (video this https URL). Comments: 8 pages, 6 figures Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.04775 [cs.RO] (or arXiv:2409.04775v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2409.04775 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-127] Untie the Knots: An Efficient Data Augmentation Strategy for Long-Context Pre-Training in Language Models

链接: https://arxiv.org/abs/2409.04774
作者: Junfeng Tian,Da Zheng,Yang Cheng,Rui Wang,Colin Zhang,Debing Zhang
关键词-EN: Large language models, Large language, prioritized expanding, Large, context window
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLM) have prioritized expanding the context window from which models can incorporate more information. However, training models to handle long contexts presents significant challenges. These include the scarcity of high-quality natural long-context data, the potential for performance degradation on short-context tasks, and the reduced training efficiency associated with attention mechanisms. In this paper, we introduce Untie the Knots (\textbfUtK), a novel data augmentation strategy employed during the continue pre-training phase, designed to efficiently enable LLMs to gain long-context capabilities without the need to modify the existing data mixture. In particular, we chunk the documents, shuffle the chunks, and create a complex and knotted structure of long texts; LLMs are then trained to untie these knots and identify relevant segments within seemingly chaotic token sequences. This approach greatly improves the model’s performance by accurately attending to relevant information in long context and the training efficiency is also largely increased. We conduct extensive experiments on models with 7B and 72B parameters, trained on 20 billion tokens, demonstrating that UtK achieves 75% and 84.5% accurracy on RULER at 128K context length, significantly outperforming other long context strategies. The trained models will open-source for further research.

[AI-128] LMGT: Optimizing Exploration-Exploitation Balance in Reinforcement Learning through Language Model Guided Trade-offs

链接: https://arxiv.org/abs/2409.04744
作者: Yongxin Deng,Xihe Qiu,Xiaoyu Tan,Wei Chu,Yinghui Xu
关键词-EN: environmental transition model, agent expected reward, Reinforcement Learning, necessitates a careful, uncertainty inherent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The uncertainty inherent in the environmental transition model of Reinforcement Learning (RL) necessitates a careful balance between exploration and exploitation to optimize the use of computational resources for accurately estimating an agent’s expected reward. Achieving balance in control systems is particularly challenging in scenarios with sparse rewards. However, given the extensive prior knowledge available for many environments, it is redundant to begin learning from scratch in such settings. To address this, we introduce \textbfLanguage \textbfModel \textbfGuided \textbfTrade-offs (i.e., \textbfLMGT), a novel, sample-efficient framework that leverages the comprehensive prior knowledge embedded in Large Language Models (LLMs) and their adeptness at processing non-standard data forms, such as wiki tutorials. LMGT proficiently manages the exploration-exploitation trade-off by employing reward shifts guided by LLMs, which direct agents’ exploration endeavors, thereby improving sample efficiency. We have thoroughly tested LMGT across various RL tasks and deployed it in industrial-grade RL recommendation systems, where it consistently outperforms baseline methods. The results indicate that our framework can significantly reduce the time cost required during the training phase in RL.

[AI-129] Up-sampling-only and Adaptive Mesh-based GNN for Simulating Physical Systems

链接: https://arxiv.org/abs/2409.04740
作者: Fu Lin,Jiasheng Shi,Shijie Luo,Qinpei Zhao,Weixiong Rao,Lei Chen
关键词-EN: Partial Differential Equations, Finite Element Method, Differential Equations, Element Method, Partial Differential
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Traditional simulation of complex mechanical systems relies on numerical solvers of Partial Differential Equations (PDEs), e.g., using the Finite Element Method (FEM). The FEM solvers frequently suffer from intensive computation cost and high running time. Recent graph neural network (GNN)-based simulation models can improve running time meanwhile with acceptable accuracy. Unfortunately, they are hard to tailor GNNs for complex mechanical systems, including such disadvantages as ineffective representation and inefficient message propagation (MP). To tackle these issues, in this paper, with the proposed Up-sampling-only and Adaptive MP techniques, we develop a novel hierarchical Mesh Graph Network, namely UA-MGN, for efficient and effective mechanical simulation. Evaluation on two synthetic and one real datasets demonstrates the superiority of the UA-MGN. For example, on the Beam dataset, compared to the state-of-the-art MS-MGN, UA-MGN leads to 40.99% lower errors but using only 43.48% fewer network parameters and 4.49% fewer floating point operations (FLOPs).

[AI-130] VidLPRO: A underlineVideo-underlineLanguage underlinePre-training Framework for underlineRobotic and Laparoscopic Surgery

链接: https://arxiv.org/abs/2409.04732
作者: Mohammadmahdi Honarmand,Muhammad Abdullah Jamal,Omid Mohareri
关键词-EN: pre-training framework designed, framework designed specifically, laparoscopic surgery, designed specifically, specifically for robotic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce VidLPRO, a novel video-language (VL) pre-training framework designed specifically for robotic and laparoscopic surgery. While existing surgical VL models primarily rely on contrastive learning, we propose a more comprehensive approach to capture the intricate temporal dynamics and align video with language. VidLPRO integrates video-text contrastive learning, video-text matching, and masked language modeling objectives to learn rich VL representations. To support this framework, we present GenSurg+, a carefully curated dataset derived from GenSurgery, comprising 17k surgical video clips paired with captions generated by GPT-4 using transcripts extracted by the Whisper model. This dataset addresses the need for large-scale, high-quality VL data in the surgical domain. Extensive experiments on benchmark datasets, including Cholec80 and AutoLaparo, demonstrate the efficacy of our approach. VidLPRO achieves state-of-the-art performance in zero-shot surgical phase recognition, significantly outperforming existing surgical VL models such as SurgVLP and HecVL. Our model demonstrates improvements of up to 21.5% in accuracy and 15.7% in F1 score, setting a new benchmark in the field. Notably, VidLPRO exhibits robust performance even with single-frame inference, while effectively scaling with increased temporal context. Ablation studies reveal the impact of frame sampling strategies on model performance and computational efficiency. These results underscore VidLPRO’s potential as a foundation model for surgical video understanding.

[AI-131] A Comprehensive Survey on Evidential Deep Learning and Its Applications

链接: https://arxiv.org/abs/2409.04720
作者: Junyu Gao,Mengyuan Chen,Liangyu Xiang,Changsheng Xu
关键词-EN: Reliable uncertainty estimation, deep learning algorithms, uncertainty estimation, medical diagnosis, Reliable uncertainty
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reliable uncertainty estimation has become a crucial requirement for the industrial deployment of deep learning algorithms, particularly in high-risk applications such as autonomous driving and medical diagnosis. However, mainstream uncertainty estimation methods, based on deep ensembling or Bayesian neural networks, generally impose substantial computational overhead. To address this challenge, a novel paradigm called Evidential Deep Learning (EDL) has emerged, providing reliable uncertainty estimation with minimal additional computation in a single forward pass. This survey provides a comprehensive overview of the current research on EDL, designed to offer readers a broad introduction to the field without assuming prior knowledge. Specifically, we first delve into the theoretical foundation of EDL, the subjective logic theory, and discuss its distinctions from other uncertainty estimation frameworks. We further present existing theoretical advancements in EDL from four perspectives: reformulating the evidence collection process, improving uncertainty estimation via OOD samples, delving into various training strategies, and evidential regression networks. Thereafter, we elaborate on its extensive applications across various machine learning paradigms and downstream tasks. In the end, an outlook on future directions for better performances and broader adoption of EDL is provided, highlighting potential research avenues.

[AI-132] Algorithmic Scenario Generation as Quality Diversity Optimization

链接: https://arxiv.org/abs/2409.04711
作者: Stefanos Nikolaidis
关键词-EN: increasing complexity, complexity of robots, robots and autonomous, autonomous agents, agents that interact
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The increasing complexity of robots and autonomous agents that interact with people highlights the critical need for approaches that systematically test them before deployment. This review paper presents a general framework for solving this problem, describes the insights that we have gained from working on each component of the framework, and shows how integrating these components leads to the discovery of a diverse range of realistic and challenging scenarios that reveal previously unknown failures in deployed robotic systems interacting with people.

[AI-133] Enhancing Deep Learning with Optimized Gradient Descent: Bridging Numerical Methods and Neural Network Training

链接: https://arxiv.org/abs/2409.04707
作者: Yuhan Ma,Dan Sun,Erdi Gao,Ningjing Sang,Iris Li,Guanming Huang
关键词-EN: optimal system performance, pivotal scientific instrument, achieving optimal system, Optimization theory serves, Optimization theory
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Optimization theory serves as a pivotal scientific instrument for achieving optimal system performance, with its origins in economic applications to identify the best investment strategies for maximizing benefits. Over the centuries, from the geometric inquiries of ancient Greece to the calculus contributions by Newton and Leibniz, optimization theory has significantly advanced. The persistent work of scientists like Lagrange, Cauchy, and von Neumann has fortified its progress. The modern era has seen an unprecedented expansion of optimization theory applications, particularly with the growth of computer science, enabling more sophisticated computational practices and widespread utilization across engineering, decision analysis, and operations research. This paper delves into the profound relationship between optimization theory and deep learning, highlighting the omnipresence of optimization problems in the latter. We explore the gradient descent algorithm and its variants, which are the cornerstone of optimizing neural networks. The chapter introduces an enhancement to the SGD optimizer, drawing inspiration from numerical optimization methods, aiming to enhance interpretability and accuracy. Our experiments on diverse deep learning tasks substantiate the improved algorithm’s efficacy. The paper concludes by emphasizing the continuous development of optimization theory and its expanding role in solving intricate problems, enhancing computational capabilities, and informing better policy decisions.

[AI-134] A Multi-scenario Attention-based Generative Model for Personalized Blood Pressure Time Series Forecasting

链接: https://arxiv.org/abs/2409.04704
作者: Cheng Wan,Chenjie Xie,Longfei Liu,Dan Wu,Ye Li
关键词-EN: Continuous blood pressure, critical care settings, blood pressure, monitoring is essential, essential for timely
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 5 pages, 2 figures

点击查看摘要

Abstract:Continuous blood pressure (BP) monitoring is essential for timely diagnosis and intervention in critical care settings. However, BP varies significantly across individuals, this inter-patient variability motivates the development of personalized models tailored to each patient’s physiology. In this work, we propose a personalized BP forecasting model mainly using electrocardiogram (ECG) and photoplethysmogram (PPG) signals. This time-series model incorporates 2D representation learning to capture complex physiological relationships. Experiments are conducted on datasets collected from three diverse scenarios with BP measurements from 60 subjects total. Results demonstrate that the model achieves accurate and robust BP forecasts across scenarios within the Association for the Advancement of Medical Instrumentation (AAMI) standard criteria. This reliable early detection of abnormal fluctuations in BP is crucial for at-risk patients undergoing surgery or intensive care. The proposed model provides a valuable addition for continuous BP tracking to reduce mortality and improve prognosis.

[AI-135] MuAP: Multi-step Adaptive Prompt Learning for Vision-Language Model with Missing Modality

链接: https://arxiv.org/abs/2409.04693
作者: Ruiting Dai,Yuqiao Tan,Lisi Mo,Tao He,Ke Qin,Shuang Liang
关键词-EN: garnered considerable attention, prompt learning, garnered considerable, considerable attention, Adaptive Prompt Learning
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, prompt learning has garnered considerable attention for its success in various Vision-Language (VL) tasks. However, existing prompt-based models are primarily focused on studying prompt generation and prompt strategies with complete modality settings, which does not accurately reflect real-world scenarios where partial modality information may be missing. In this paper, we present the first comprehensive investigation into prompt learning behavior when modalities are incomplete, revealing the high sensitivity of prompt-based models to missing modalities. To this end, we propose a novel Multi-step Adaptive Prompt Learning (MuAP) framework, aiming to generate multimodal prompts and perform multi-step prompt tuning, which adaptively learns knowledge by iteratively aligning modalities. Specifically, we generate multimodal prompts for each modality and devise prompt strategies to integrate them into the Transformer model. Subsequently, we sequentially perform prompt tuning from single-stage and alignment-stage, allowing each modality-prompt to be autonomously and adaptively learned, thereby mitigating the imbalance issue caused by only textual prompts that are learnable in previous works. Extensive experiments demonstrate the effectiveness of our MuAP and this model achieves significant improvements compared to the state-of-the-art on all benchmark datasets

[AI-136] Solving Stochastic Orienteering Problems with Chance Constraints Using a GNN Powered Monte Carlo Tree Search

链接: https://arxiv.org/abs/2409.04653
作者: Marcos Abel Zuzuárregui,Stefano Carpin
关键词-EN: Monte Carlo Tree, Carlo Tree Search, graph neural network, Monte Carlo, Carlo Tree
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:Leveraging the power of a graph neural network (GNN) with message passing, we present a Monte Carlo Tree Search (MCTS) method to solve stochastic orienteering problems with chance constraints. While adhering to an assigned travel budget the algorithm seeks to maximize collected reward while incurring stochastic travel costs. In this context, the acceptable probability of exceeding the assigned budget is expressed as a chance constraint. Our MCTS solution is an online and anytime algorithm alternating planning and execution that determines the next vertex to visit by continuously monitoring the remaining travel budget. The novelty of our work is that the rollout phase in the MCTS framework is implemented using a message passing GNN, predicting both the utility and failure probability of each available action. This allows to enormously expedite the search process. Our experimental evaluation shows that with the proposed method and architecture we manage to efficiently solve complex problem instances while incurring in moderate losses in terms of collected reward. Moreover, we demonstrate how the approach is capable of generalizing beyond the characteristics of the training dataset. The paper’s website, open-source code, and supplementary documentation can be found at this http URL.

[AI-137] Stacked Universal Successor Feature Approximators for Safety in Reinforcement Learning

链接: https://arxiv.org/abs/2409.04641
作者: Ian Cannon,Washington Garcia,Thomas Gresavage,Joseph Saurine,Ian Leong,Jared Culbertson
关键词-EN: complex objective structures, involve complex objective, problems often involve, involve complex, structures that resist
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 13 pages

点击查看摘要

Abstract:Real-world problems often involve complex objective structures that resist distillation into reinforcement learning environments with a single objective. Operation costs must be balanced with multi-dimensional task performance and end-states’ effects on future availability, all while ensuring safety for other agents in the environment and the reinforcement learning agent itself. System redundancy through secondary backup controllers has proven to be an effective method to ensure safety in real-world applications where the risk of violating constraints is extremely high. In this work, we investigate the utility of a stacked, continuous-control variation of universal successor feature approximation (USFA) adapted for soft actor-critic (SAC) and coupled with a suite of secondary safety controllers, which we call stacked USFA for safety (SUSFAS). Our method improves performance on secondary objectives compared to SAC baselines using an intervening secondary controller such as a runtime assurance (RTA) controller.

[AI-138] Decentralized Learning in General-sum Markov Games

链接: https://arxiv.org/abs/2409.04613
作者: Chinmay Maheshwari,Manxi Wu,Shankar Sastry
关键词-EN: approximate Nash equilibria, decentralized learning algorithms, Nash equilibria, Markov game framework, uncertain societal-scale systems
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: 16 pages, 1 figure

点击查看摘要

Abstract:The Markov game framework is widely used to model interactions among agents with heterogeneous utilities in dynamic and uncertain societal-scale systems. In these systems, agents typically operate in a decentralized manner due to privacy and scalability concerns, often acting without any information about other agents. The design and analysis of decentralized learning algorithms that provably converge to rational outcomes remain elusive, especially beyond Markov zero-sum games and Markov potential games, which do not adequately capture the nature of many real-world interactions that is neither fully competitive nor fully cooperative. This paper investigates the design of decentralized learning algorithms for general-sum Markov games, aiming to provide provable guarantees of convergence to approximate Nash equilibria in the long run. Our approach builds on constructing a Markov Near-Potential Function (MNPF) to address the intractability of designing algorithms that converge to exact Nash equilibria. We demonstrate that MNPFs play a central role in ensuring the convergence of an actor-critic-based decentralized learning algorithm to approximate Nash equilibria. By leveraging a two-timescale approach, where Q-function estimates are updated faster than policy updates, we show that the system converges to a level set of the MNPF over the set of approximate Nash equilibria. This convergence result is further strengthened if the set of Nash equilibria is assumed to be finite. Our findings provide a new perspective on the analysis and design of decentralized learning algorithms in multi-agent systems.

[AI-139] Detection of False Data Injection Attacks (FDIA) on Power Dynamical Systems With a State Prediction Method

链接: https://arxiv.org/abs/2409.04609
作者: Abhijeet Sahu,Truc Nguyen,Kejun Chen,Xiangyu Zhang,Malik Hassanaly
关键词-EN: false data injection, growing cyber-security concern, data injection attacks, FDIA, FDIA detection method
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:With the deeper penetration of inverter-based resources in power systems, false data injection attacks (FDIA) are a growing cyber-security concern. They have the potential to disrupt the system’s stability like frequency stability, thereby leading to catastrophic failures. Therefore, an FDIA detection method would be valuable to protect power systems. FDIAs typically induce a discrepancy between the desired and the effective behavior of the power system dynamics. A suitable detection method can leverage power dynamics predictions to identify whether such a discrepancy was induced by an FDIA. This work investigates the efficacy of temporal and spatio-temporal state prediction models, such as Long Short-Term Memory (LSTM) and a combination of Graph Neural Networks (GNN) with LSTM, for predicting frequency dynamics in the absence of an FDIA but with noisy measurements, and thereby identify FDIA events. For demonstration purposes, the IEEE 39 New England Kron-reduced model simulated with a swing equation is considered. It is shown that the proposed state prediction models can be used as a building block for developing an effective FDIA detection method that can maintain high detection accuracy across various attack and deployment settings. It is also shown how the FDIA detection should be deployed to limit its exposure to detection inaccuracies and mitigate its computational burden.

[AI-140] he emergence of Large Language Models (LLM) as a tool in literature reviews: an LLM automated systematic review

链接: https://arxiv.org/abs/2409.04600
作者: Dmitry Scherbakov,Nina Hubig,Vinita Jansari,Alexander Bakumenko,Leslie A. Lenert
关键词-EN: Large Language Models, Large Language, usage of Large, Language Models, study aims
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
*备注: 18 main pages with 5 figures and 1 table, references, followed by supplementary materials

点击查看摘要

Abstract:Objective: This study aims to summarize the usage of Large Language Models (LLMs) in the process of creating a scientific review. We look at the range of stages in a review that can be automated and assess the current state-of-the-art research projects in the field. Materials and Methods: The search was conducted in June 2024 in PubMed, Scopus, Dimensions, and Google Scholar databases by human reviewers. Screening and extraction process took place in Covidence with the help of LLM add-on which uses OpenAI gpt-4o model. ChatGPT was used to clean extracted data and generate code for figures in this manuscript, ChatGPT and this http URL were used in drafting all components of the manuscript, except the methods and discussion sections. Results: 3,788 articles were retrieved, and 172 studies were deemed eligible for the final review. ChatGPT and GPT-based LLM emerged as the most dominant architecture for review automation (n=126, 73.2%). A significant number of review automation projects were found, but only a limited number of papers (n=26, 15.1%) were actual reviews that used LLM during their creation. Most citations focused on automation of a particular stage of review, such as Searching for publications (n=60, 34.9%), and Data extraction (n=54, 31.4%). When comparing pooled performance of GPT-based and BERT-based models, the former were better in data extraction with mean precision 83.0% (SD=10.4), and recall 86.0% (SD=9.8), while being slightly less accurate in title and abstract screening stage (Maccuracy=77.3%, SD=13.0). Discussion/Conclusion: Our LLM-assisted systematic review revealed a significant number of research projects related to review automation using LLMs. The results looked promising, and we anticipate that LLMs will change in the near future the way the scientific reviews are conducted.

[AI-141] CubicML: Automated ML for Distributed ML Systems Co-design with ML Prediction of Performance

链接: https://arxiv.org/abs/2409.04585
作者: Wei Wen,Quanyu Zhu,Weiwei Chu,Wen-Yen Chen,Jiyan Yang
关键词-EN: Scaling up deep, deep learning models, deep learning, machine learning, proven effective
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Scaling up deep learning models has been proven effective to improve intelligence of machine learning (ML) models, especially for industry recommendation models and large language models. The co-design of distributed ML systems and algorithms (to maximize training performance) plays a pivotal role for its success. As it scales, the number of co-design hyper-parameters grows rapidly which brings challenges to feasibly find the optimal setup for system performance maximization. In this paper, we propose CubicML which uses ML to automatically optimize training performance of distributed ML systems. In CubicML, we use a ML model as a proxy to predict the training performance for search efficiency and performance modeling flexibility. We proved that CubicML can effectively optimize training speed of in-house ads recommendation models and large language models at Meta.

[AI-142] ActionFlow: Equivariant Accurate and Efficient Policies with Spatially Symmetric Flow Matching

链接: https://arxiv.org/abs/2409.04576
作者: Niklas Funk,Julen Urain,Joao Carvalho,Vignesh Prasad,Georgia Chalvatzaki,Jan Peters
关键词-EN: critical aspect, Spatial, Flow Matching, ActionFlow, Spatial understanding
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Spatial understanding is a critical aspect of most robotic tasks, particularly when generalization is important. Despite the impressive results of deep generative models in complex manipulation tasks, the absence of a representation that encodes intricate spatial relationships between observations and actions often limits spatial generalization, necessitating large amounts of demonstrations. To tackle this problem, we introduce a novel policy class, ActionFlow. ActionFlow integrates spatial symmetry inductive biases while generating expressive action sequences. On the representation level, ActionFlow introduces an SE(3) Invariant Transformer architecture, which enables informed spatial reasoning based on the relative SE(3) poses between observations and actions. For action generation, ActionFlow leverages Flow Matching, a state-of-the-art deep generative model known for generating high-quality samples with fast inference - an essential property for feedback control. In combination, ActionFlow policies exhibit strong spatial and locality biases and SE(3)-equivariant action generation. Our experiments demonstrate the effectiveness of ActionFlow and its two main components on several simulated and real-world robotic manipulation tasks and confirm that we can obtain equivariant, accurate, and efficient policies with spatially symmetric flow matching. Project website: this https URL

[AI-143] Neurosymbolic Methods for Dynamic Knowledge Graphs

链接: https://arxiv.org/abs/2409.04572
作者: Mehwish Alam,Genet Asefa Gesese,Pierre-Henri Paris
关键词-EN: Knowledge graphs, tools and applications, structured format, rich resources, resources in structured
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Knowledge graphs (KGs) have recently been used for many tools and applications, making them rich resources in structured format. However, in the real world, KGs grow due to the additions of new knowledge in the form of entities and relations, making these KGs dynamic. This chapter formally defines several types of dynamic KGs and summarizes how these KGs can be represented. Additionally, many neurosymbolic methods have been proposed for learning representations over static KGs for several tasks such as KG completion and entity alignment. This chapter further focuses on neurosymbolic methods for dynamic KGs with or without temporal information. More specifically, it provides an insight into neurosymbolic methods for dynamic (temporal or non-temporal) KG completion and entity alignment tasks. It further discusses the challenges of current approaches and provides some future directions.

[AI-144] hinking Outside the BBox: Unconstrained Generative Object Compositing

链接: https://arxiv.org/abs/2409.04559
作者: Gemma Canet Tarrés,Zhe Lin,Zhifei Zhang,Jianming Zhang,Yizhi Song,Dan Ruta,Andrew Gilbert,John Collomosse,Soo Ye Kim
关键词-EN: involves multiple non-trivial, multiple non-trivial sub-tasks, lighting harmonization, geometry adjustment, image involves multiple
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Compositing an object into an image involves multiple non-trivial sub-tasks such as object placement and scaling, color/lighting harmonization, viewpoint/geometry adjustment, and shadow/reflection generation. Recent generative image compositing methods leverage diffusion models to handle multiple sub-tasks at once. However, existing models face limitations due to their reliance on masking the original object during training, which constrains their generation to the input mask. Furthermore, obtaining an accurate input mask specifying the location and scale of the object in a new image can be highly challenging. To overcome such limitations, we define a novel problem of unconstrained generative object compositing, i.e., the generation is not bounded by the mask, and train a diffusion-based model on a synthesized paired dataset. Our first-of-its-kind model is able to generate object effects such as shadows and reflections that go beyond the mask, enhancing image realism. Additionally, if an empty mask is provided, our model automatically places the object in diverse natural locations and scales, accelerating the compositing workflow. Our model outperforms existing object placement and compositing models in various quality metrics and user studies.

[AI-145] Learning to Solve Combinatorial Optimization under Positive Linear Constraints via Non-Autoregressive Neural Networks

链接: https://arxiv.org/abs/2409.04495
作者: Runzhong Wang,Yang Li,Junchi Yan,Xiaokang Yang
关键词-EN: Combinatorial optimization, applied mathematics, computer science, intersection of computer, Combinatorial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: English version of the same paper published on Scientia Sinica Informationis

点击查看摘要

Abstract:Combinatorial optimization (CO) is the fundamental problem at the intersection of computer science, applied mathematics, etc. The inherent hardness in CO problems brings up challenge for solving CO exactly, making deep-neural-network-based solvers a research frontier. In this paper, we design a family of non-autoregressive neural networks to solve CO problems under positive linear constraints with the following merits. First, the positive linear constraint covers a wide range of CO problems, indicating that our approach breaks the generality bottleneck of existing non-autoregressive networks. Second, compared to existing autoregressive neural network solvers, our non-autoregressive networks have the advantages of higher efficiency and preserving permutation invariance. Third, our offline unsupervised learning has lower demand on high-quality labels, getting rid of the demand of optimal labels in supervised learning. Fourth, our online differentiable search method significantly improves the generalizability of our neural network solver to unseen problems. We validate the effectiveness of this framework in solving representative CO problems including facility location, max-set covering, and traveling salesman problem. Our non-autoregressive neural solvers are competitive to and can be even superior to state-of-the-art solvers such as SCIP and Gurobi, especially when both efficiency and efficacy are considered. Code is available at this https URL

[AI-146] Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small

链接: https://arxiv.org/abs/2409.04478
作者: Maheep Chaudhary,Atticus Geiger
关键词-EN: high-dimensional sparse autoencoders, train high-dimensional sparse, sparse autoencoders, popular new method, method in mechanistic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:A popular new method in mechanistic interpretability is to train high-dimensional sparse autoencoders (SAEs) on neuron activations and use SAE features as the atomic units of analysis. However, the body of evidence on whether SAE feature spaces are useful for causal analysis is underdeveloped. In this work, we use the RAVEL benchmark to evaluate whether SAEs trained on hidden representations of GPT-2 small have sets of features that separately mediate knowledge of which country a city is in and which continent it is in. We evaluate four open-source SAEs for GPT-2 small against each other, with neurons serving as a baseline, and linear features learned via distributed alignment search (DAS) serving as a skyline. For each, we learn a binary mask to select features that will be patched to change the country of a city without changing the continent, or vice versa. Our results show that SAEs struggle to reach the neuron baseline, and none come close to the DAS skyline. We release code here: this https URL

[AI-147] Revolutionizing Database QA with Large Language Models : Comprehensive Benchmark and Evaluation

链接: https://arxiv.org/abs/2409.04475
作者: Yihang Zheng,Bo Li,Zhenghao Lin,Yi Luo,Xuanhe Zhou,Chen Lin,Jinsong Su,Guoliang Li,Shifu Li
关键词-EN: Large Language Models, Language Models, Large Language, database, development of Large
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注: 12 pages

点击查看摘要

Abstract:The development of Large Language Models (LLMs) has revolutionized QA across various industries, including the database domain. However, there is still a lack of a comprehensive benchmark to evaluate the capabilities of different LLMs and their modular components in database QA. To this end, we introduce DQA, the first comprehensive database QA benchmark. DQA features an innovative LLM-based method for automating the generation, cleaning, and rewriting of database QA, resulting in over 240,000 QA pairs in English and Chinese. These QA pairs cover nearly all aspects of database knowledge, including database manuals, database blogs, and database tools. This inclusion allows for additional assessment of LLMs’ Retrieval-Augmented Generation (RAG) and Tool Invocation Generation (TIG) capabilities in the database QA task. Furthermore, we propose a comprehensive LLM-based database QA testbed on DQA. This testbed is highly modular and scalable, with both basic and advanced components like Question Classification Routing (QCR), RAG, TIG, and Prompt Template Engineering (PTE). Besides, DQA provides a complete evaluation pipeline, featuring diverse metrics and a standardized evaluation process to ensure comprehensiveness, accuracy, and fairness. We use DQA to evaluate the database QA capabilities under the proposed testbed comprehensively. The evaluation reveals findings like (i) the strengths and limitations of nine different LLM-based QA bots and (ii) the performance impact and potential improvements of various service components (e.g., QCR, RAG, TIG). We hope our benchmark and findings will better guide the future development of LLM-based database QA research.

[AI-148] Learning in Order! A Sequential Strategy to Learn Invariant Features for Multimodal Sentiment Analysis

链接: https://arxiv.org/abs/2409.04473
作者: Xianbing Zhao,Lizhen Qu,Tao Feng,Jianfei Cai,Buzhou Tang
关键词-EN: simple sequential learning, multimodal sentiment analysis, simple sequential, sequential learning strategy, sentiment analysis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This work proposes a novel and simple sequential learning strategy to train models on videos and texts for multimodal sentiment analysis. To estimate sentiment polarities on unseen out-of-distribution data, we introduce a multimodal model that is trained either in a single source domain or multiple source domains using our learning strategy. This strategy starts with learning domain invariant features from text, followed by learning sparse domain-agnostic features from videos, assisted by the selected features learned in text. Our experimental results demonstrate that our model achieves significantly better performance than the state-of-the-art approaches on average in both single-source and multi-source settings. Our feature selection procedure favors the features that are independent to each other and are strongly correlated with their polarity labels. To facilitate research on this topic, the source code of this work will be publicly available upon acceptance.

[AI-149] Intensional FOL: Many-Sorted Extension

链接: https://arxiv.org/abs/2409.04469
作者: Zoran Majkic
关键词-EN: sorted attributes, IFOL, intensional concepts, list of sorted, many-sorted IFOL
类目: Artificial Intelligence (cs.AI)
*备注: 21 pages

点击查看摘要

Abstract:The concepts used in IFOL have associated to them a list of sorted attributes, and the sorts are the intensional concepts as well. The requirement to extend the unsorted IFOL (Intensional FOL) to many-sorted IFOL is mainly based on the fact that a natural language is implicitly many-sorted and that we intend to use IFOL to support applications that use natural languages. Thus, the proposed version of many-sorted IFOL is just the completion of this conceptual feature of the IFOL.

[AI-150] Heres Charlie! Realising the Semantic Web vision of Agents in the age of LLMs

链接: https://arxiv.org/abs/2409.04465
作者: Jesse Wright
关键词-EN: entrust semi-autonomous AI-driven, semi-autonomous AI-driven agents, legal entities, semi-autonomous Web agents, carry out online
类目: Artificial Intelligence (cs.AI)
*备注: The 23rd International Semantic Web Conference, November 11–15, 2024, Hanover, MD - Posters and Demos track

点击查看摘要

Abstract:This paper presents our research towards a near-term future in which legal entities, such as individuals and organisations can entrust semi-autonomous AI-driven agents to carry out online interactions on their behalf. The author’s research concerns the development of semi-autonomous Web agents, which consult users if and only if the system does not have sufficient context or confidence to proceed working autonomously. This creates a user-agent dialogue that allows the user to teach the agent about the information sources they trust, their data-sharing preferences, and their decision-making preferences. Ultimately, this enables the user to maximise control over their data and decisions while retaining the convenience of using agents, including those driven by LLMs. In view of developing near-term solutions, the research seeks to answer the question: “How do we build a trustworthy and reliable network of semi-autonomous agents which represent individuals and organisations on the Web?”. After identifying key requirements, the paper presents a demo for a sample use case of a generic personal assistant. This is implemented using (Notation3) rules to enforce safety guarantees around belief, data sharing and data usage and LLMs to allow natural language interaction with users and serendipitous dialogues between software agents. Comments: The 23rd International Semantic Web Conference, November 11–15, 2024, Hanover, MD - Posters and Demos track Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2409.04465 [cs.AI] (or arXiv:2409.04465v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2409.04465 Focus to learn more arXiv-issued DOI via DataCite

[AI-151] Leveraging Large Language Models for Solving Rare MIP Challenges

链接: https://arxiv.org/abs/2409.04464
作者: Teng Wang,Wing-Yin Yu,Ruifeng She,Wenhan Yang,Taijie Chen,Jianping Zhang
关键词-EN: Mixed Integer Programming, Mixed Integer, Integer Programming, tight time constraints, areas requiring mathematical
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Mixed Integer Programming (MIP) has been extensively applied in areas requiring mathematical solvers to address complex instances within tight time constraints. However, as the problem scale increases, the complexity of model formulation and finding feasible solutions escalates significantly. In contrast, the model-building cost for end-to-end models, such as large language models (LLMs), remains largely unaffected by problem scale due to their pattern recognition capabilities. While LLMs, like GPT-4, without fine-tuning, can handle some traditional medium-scale MIP problems, they struggle with uncommon or highly specialized MIP scenarios. Fine-tuning LLMs can yield some feasible solutions for medium-scale MIP instances, but these models typically fail to explore diverse solutions when constrained by a low and constant temperature, limiting their performance. In this paper, we propose and evaluate a recursively dynamic temperature method integrated with a chain-of-thought approach. Our findings show that starting with a high temperature and gradually lowering it leads to better feasible solutions compared to other dynamic temperature strategies. Additionally, by comparing results generated by the LLM with those from Gurobi, we demonstrate that the LLM can produce solutions that complement traditional solvers by accelerating the pruning process and improving overall efficiency.

[AI-152] Process Trace Querying using Knowledge Graphs and Notation3

链接: https://arxiv.org/abs/2409.04452
作者: William Van Woensel
关键词-EN: log exploration step, process mining, identifying event patterns, Resource Description Framework, step allows making
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In process mining, a log exploration step allows making sense of the event traces; e.g., identifying event patterns and illogical traces, and gaining insight into their variability. To support expressive log exploration, the event log can be converted into a Knowledge Graph (KG), which can then be queried using general-purpose languages. We explore the creation of semantic KG using the Resource Description Framework (RDF) as a data model, combined with the general-purpose Notation3 (N3) rule language for querying. We show how typical trace querying constraints, inspired by the state of the art, can be implemented in N3. We convert case- and object-centric event logs into a trace-based semantic KG; OCEL2 logs are hereby “flattened” into traces based on object paths through the KG. This solution offers (a) expressivity, as queries can instantiate constraints in multiple ways and arbitrarily constrain attributes and relations (e.g., actors, resources); (b) flexibility, as OCEL2 event logs can be serialized as traces in arbitrary ways based on the KG; and © extensibility, as others can extend our library by leveraging the same implementation patterns.

[AI-153] Leveraging Contrastive Learning and Self-Training for Multimodal Emotion Recognition with Limited Labeled Samples ACM-MM

链接: https://arxiv.org/abs/2409.04447
作者: Qi Fan,Yutong Li,Yi Xin,Xinyu Cheng,Guanglai Gao,Miao Ma
关键词-EN: Multimodal Emotion Recognition, Emotion Recognition challenge, Multimodal Emotion, Emotion Recognition, focuses on recognizing
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Accepted by ACM MM Workshop 2024

点击查看摘要

Abstract:The Multimodal Emotion Recognition challenge MER2024 focuses on recognizing emotions using audio, language, and visual signals. In this paper, we present our submission solutions for the Semi-Supervised Learning Sub-Challenge (MER2024-SEMI), which tackles the issue of limited annotated data in emotion recognition. Firstly, to address the class imbalance, we adopt an oversampling strategy. Secondly, we propose a modality representation combinatorial contrastive learning (MR-CCL) framework on the trimodal input data to establish robust initial models. Thirdly, we explore a self-training approach to expand the training set. Finally, we enhance prediction robustness through a multi-classifier weighted soft voting strategy. Our proposed method is validated to be effective on the MER2024-SEMI Challenge, achieving a weighted average F-score of 88.25% and ranking 6th on the leaderboard. Our project is available at this https URL.

[AI-154] Automatic Detection of LLM-generated Code: A Case Study of Claude 3 Haiku

链接: https://arxiv.org/abs/2409.01382
作者: Musfiqur Rahman,SayedHassan Khatoonabadi,Ahmad Abdellatif,Emad Shihab
关键词-EN: Large Language Models, Large Language, Claude, generating source code, Language Models
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Submitted to a journal for potential publication

点击查看摘要

Abstract:Using Large Language Models (LLMs) has gained popularity among software developers for generating source code. However, the use of LLM-generated code can introduce risks of adding suboptimal, defective, and vulnerable code. This makes it necessary to devise methods for the accurate detection of LLM-generated code. Toward this goal, we perform a case study of Claude 3 Haiku (or Claude 3 for brevity) on CodeSearchNet dataset. We divide our analyses into two parts: function-level and class-level. We extract 22 software metric features, such as Code Lines and Cyclomatic Complexity, for each level of granularity. We then analyze code snippets generated by Claude 3 and their human-authored counterparts using the extracted features to understand how unique the code generated by Claude 3 is. In the following step, we use the unique characteristics of Claude 3-generated code to build Machine Learning (ML) models and identify which features of the code snippets make them more detectable by ML models. Our results indicate that Claude 3 tends to generate longer functions, but shorter classes than humans, and this characteristic can be used to detect Claude 3-generated code with ML models with 82% and 66% accuracies for function-level and class-level snippets, respectively.

[AI-155] An Introduction to Quantum Reinforcement Learning (QRL)

链接: https://arxiv.org/abs/2409.05846
作者: Samuel Yen-Chi Chen
关键词-EN: sparked considerable interest, Recent advancements, Quantum Reinforcement Learning, reinforcement learning, machine learning
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Accepted by The 15th International Conference on ICT Convergence - ICTC 2024

点击查看摘要

Abstract:Recent advancements in quantum computing (QC) and machine learning (ML) have sparked considerable interest in the integration of these two cutting-edge fields. Among the various ML techniques, reinforcement learning (RL) stands out for its ability to address complex sequential decision-making problems. RL has already demonstrated substantial success in the classical ML community. Now, the emerging field of Quantum Reinforcement Learning (QRL) seeks to enhance RL algorithms by incorporating principles from quantum computing. This paper offers an introduction to this exciting area for the broader AI and ML community.

[AI-156] An encoding of argumentation problems using quadratic unconstrained binary optimization

链接: https://arxiv.org/abs/2409.05524
作者: Marco Baioletti,Francesco Santini
关键词-EN:
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-157] QuantFactor REINFORCE: Mining Steady Formulaic Alpha Factors with Variance-bounded REINFORCE

链接: https://arxiv.org/abs/2409.05144
作者: Junjie Zhao,Chengxi Zhang,Min Qin,Peng Yang
关键词-EN: discover indicative signals, alpha factor mining, Alpha factors, historical financial market, factor mining methods
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 15 pages, 7 figures

点击查看摘要

Abstract:The goal of alpha factor mining is to discover indicative signals of investment opportunities from the historical financial market data of assets. Deep learning based alpha factor mining methods have shown to be powerful, which, however, lack of the interpretability, making them unacceptable in the risk-sensitive real markets. Alpha factors in formulaic forms are more interpretable and therefore favored by market participants, while the search space is complex and powerful explorative methods are urged. Recently, a promising framework is proposed for generating formulaic alpha factors using deep reinforcement learning, and quickly gained research focuses from both academia and industries. This paper first argues that the originally employed policy training method, i.e., Proximal Policy Optimization (PPO), faces several important issues in the context of alpha factors mining, making it ineffective to explore the search space of the formula. Herein, a novel reinforcement learning based on the well-known REINFORCE algorithm is proposed. Given that the underlying state transition function adheres to the Dirac distribution, the Markov Decision Process within this framework exhibit minimal environmental variability, making REINFORCE algorithm more appropriate than PPO. A new dedicated baseline is designed to theoretically reduce the commonly suffered high variance of REINFORCE. Moreover, the information ratio is introduced as a reward shaping mechanism to encourage the generation of steady alpha factors that can better adapt to changes in market volatility. Experimental evaluations on various real assets data show that the proposed algorithm can increase the correlation with asset returns by 3.83%, and a stronger ability to obtain excess returns compared to the latest alpha factors mining methods, which meets the theoretical results well.

[AI-158] Nearest Neighbor CCP-Based Molecular Sequence Analysis

链接: https://arxiv.org/abs/2409.04922
作者: Sarwan Ali,Prakash Chourasia,Bipin Koirala,Murray Patterson
关键词-EN:
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-159] Efficient Training of Transformers for Molecule Property Prediction on Small-scale Datasets

链接: https://arxiv.org/abs/2409.04909
作者: Shivesh Prakash
关键词-EN:
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-160] NapTune: Efficient Model Tuning for Mood Classification using Previous Nights Sleep Measures along with Wearable Time-series

链接: https://arxiv.org/abs/2409.04723
作者: Debaditya Shome,Nasim Montazeri Ghahjaverestan,Ali Etemad
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at ICMI 2024

点击查看摘要

[AI-161] Enhancing Quantum Security over Federated Learning via Post-Quantum Cryptography

链接: https://arxiv.org/abs/2409.04637
作者: Pingzhi Li,Tianlong Chen,Junyu Liu
关键词-EN:
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Submission for IEEE 2024 IEEE Workshop on Quantum IntelLigence, Learning Security (QUILLS), this https URL

点击查看摘要

[AI-162] Zero-Shot Whole Slide Image Retrieval in Histopathology Using Embeddings of Foundation Models DATE

链接: https://arxiv.org/abs/2409.04631
作者: Saghir Alfasly,Peyman Nejat,Ghazal Alabtah,Sobhan Hemati,Krishna Rani Kalari,H.R. Tizhoosh
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper will be updated with more results

点击查看摘要

[AI-163] A Short Survey on Set-Based Aggregation Techniques for Single-Vector WSI Representation in Digital Pathology

链接: https://arxiv.org/abs/2409.04615
作者: S. Hemati,Krishna R. Kalari,H.R. Tizhoosh
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[AI-164] raining quantum machine learning model on cloud without uploading the data

链接: https://arxiv.org/abs/2409.04602
作者: Guang Ping He
关键词-EN:
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 6 pages, 1 figure

点击查看摘要

[AI-165] he role of data embedding in quantum autoencoders for improved anomaly detection

链接: https://arxiv.org/abs/2409.04519
作者: Jack Y. Araz,Michael Spannowsky
关键词-EN:
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 8 pages, 5 figures, 4 tables

点击查看摘要

[AI-166] he Current and Future Perspectives of Zinc Oxide Nanoparticles in the Treatment of Diabetes Mellitus

链接: https://arxiv.org/abs/2409.04486
作者: Iqra Yousaf
关键词-EN:
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
*备注: 21 pages, 1 figure. Includes comprehensive review of synthesis methods and biological evaluations of ZnO nanoparticles in diabetes treatment

点击查看摘要

[AI-167] Large Language Models in Drug Discovery and Development: From Disease Mechanisms to Clinical Trials

链接: https://arxiv.org/abs/2409.04481
作者: Yizhen Zheng,Huan Yee Koh,Maddie Yang,Li Li,Lauren T. May,Geoffrey I. Webb,Shirui Pan,George Church
关键词-EN:
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-168] Pattern based learning and optimisation through pricing for bin packing problem

链接: https://arxiv.org/abs/2409.04456
作者: Huayan Zhang,Ruibin Bai,Tie-Yan Liu,Jiawei Li,Bingchen Lin,Jianfeng Ren
关键词-EN:
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

计算机视觉

[CV-0] Flash Cache: Reducing Bias in Radiance Cache Based Inverse Rendering

链接: https://arxiv.org/abs/2409.05867
作者: Benjamin Attal,Dor Verbin,Ben Mildenhall,Peter Hedman,Jonathan T. Barron,Matthew O’Toole,Pratul P. Srinivasan
关键词-EN: require sampling multiple, sampling multiple points, volumetric scene representations, reconstruction are largely, require sampling
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
*备注: Website: this https URL

点击查看摘要

Abstract:State-of-the-art techniques for 3D reconstruction are largely based on volumetric scene representations, which require sampling multiple points to compute the color arriving along a ray. Using these representations for more general inverse rendering – reconstructing geometry, materials, and lighting from observed images – is challenging because recursively path-tracing such volumetric representations is expensive. Recent works alleviate this issue through the use of radiance caches: data structures that store the steady-state, infinite-bounce radiance arriving at any point from any direction. However, these solutions rely on approximations that introduce bias into the renderings and, more importantly, into the gradients used for optimization. We present a method that avoids these approximations while remaining computationally efficient. In particular, we leverage two techniques to reduce variance for unbiased estimators of the rendering equation: (1) an occlusion-aware importance sampler for incoming illumination and (2) a fast cache architecture that can be used as a control variate for the radiance from a high-quality, but more expensive, volumetric cache. We show that by removing these biases our approach improves the generality of radiance cache based inverse rendering, as well as increasing quality in the presence of challenging light transport effects such as specular reflections.

[CV-1] Neural MP: A Generalist Neural Motion Planner

链接: https://arxiv.org/abs/2409.05864
作者: Murtaza Dalal,Jiahui Yang,Russell Mendonca,Youssef Khaky,Ruslan Salakhutdinov,Deepak Pathak
关键词-EN: consumes significant amounts, planning generates solutions, computational resources, motion planning, current paradigm
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Website at this http URL . Main paper: 7 pages, 4 figures, 2 tables. Appendix: 9 pages, 5 figures, 6 tables

点击查看摘要

Abstract:The current paradigm for motion planning generates solutions from scratch for every new problem, which consumes significant amounts of time and computational resources. For complex, cluttered scenes, motion planning approaches can often take minutes to produce a solution, while humans are able to accurately and safely reach any goal in seconds by leveraging their prior experience. We seek to do the same by applying data-driven learning at scale to the problem of motion planning. Our approach builds a large number of complex scenes in simulation, collects expert data from a motion planner, then distills it into a reactive generalist policy. We then combine this with lightweight optimization to obtain a safe path for real world deployment. We perform a thorough evaluation of our method on 64 motion planning tasks across four diverse environments with randomized poses, scenes and obstacles, in the real world, demonstrating an improvement of 23%, 17% and 79% motion planning success rate over state of the art sampling, optimization and learning based planning methods. Video results available at this http URL

[CV-2] Promptable Closed-loop Traffic Simulation

链接: https://arxiv.org/abs/2409.05863
作者: Shuhan Tan,Boris Ivanovic,Yuxiao Chen,Boyi Li,Xinshuo Weng,Yulong Cao,Philipp Krähenbühl,Marco Pavone
关键词-EN: autonomous driving development, efficient autonomous driving, cornerstone for safe, safe and efficient, efficient autonomous
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Accepted to CoRL 2024. Website available at this https URL

点击查看摘要

Abstract:Simulation stands as a cornerstone for safe and efficient autonomous driving development. At its core a simulation system ought to produce realistic, reactive, and controllable traffic patterns. In this paper, we propose ProSim, a multimodal promptable closed-loop traffic simulation framework. ProSim allows the user to give a complex set of numerical, categorical or textual prompts to instruct each agent’s behavior and intention. ProSim then rolls out a traffic scenario in a closed-loop manner, modeling each agent’s interaction with other traffic participants. Our experiments show that ProSim achieves high prompt controllability given different user prompts, while reaching competitive performance on the Waymo Sim Agents Challenge when no prompt is given. To support research on promptable traffic simulation, we create ProSim-Instruct-520k, a multimodal prompt-scenario paired driving dataset with over 10M text prompts for over 520k real-world driving scenarios. We will release code of ProSim as well as data and labeling tools of ProSim-Instruct-520k at this https URL.

[CV-3] Evaluating Multiview Object Consistency in Humans and Image Models

链接: https://arxiv.org/abs/2409.05862
作者: Tyler Bonnen,Stephanie Fu,Yutong Bai,Thomas O’Connell,Yoni Friedman,Nancy Kanwisher,Joshua B. Tenenbaum,Alexei A. Efros
关键词-EN: introduce a benchmark, benchmark to directly, shape inference task, directly evaluate, evaluate the alignment
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: https:/tzler.github.io/MOCHI/ Code: this https URL Huggingface dataset: this https URL

点击查看摘要

Abstract:We introduce a benchmark to directly evaluate the alignment between human observers and vision models on a 3D shape inference task. We leverage an experimental design from the cognitive sciences which requires zero-shot visual inferences about object shape: given a set of images, participants identify which contain the same/different objects, despite considerable viewpoint variation. We draw from a diverse range of images that include common objects (e.g., chairs) as well as abstract shapes (i.e., procedurally generated `nonsense’ objects). After constructing over 2000 unique image sets, we administer these tasks to human participants, collecting 35K trials of behavioral data from over 500 participants. This includes explicit choice behaviors as well as intermediate measures, such as reaction time and gaze data. We then evaluate the performance of common vision models (e.g., DINOv2, MAE, CLIP). We find that humans outperform all models by a wide margin. Using a multi-scale evaluation approach, we identify underlying similarities and differences between models and humans: while human-model performance is correlated, humans allocate more time/processing on challenging trials. All images, data, and code can be accessed via our project page.

[CV-4] LSVOS Challenge Report: Large-scale Complex and Long Video Object Segmentation ECCV2024

链接: https://arxiv.org/abs/2409.05847
作者: Henghui Ding,Lingyi Hong,Chang Liu,Ning Xu,Linjie Yang,Yuchen Fan,Deshui Miao,Yameng Gu,Xin Li,Zhenyu He,Yaowei Wang,Ming-Hsuan Yang,Jinming Chai,Qin Ma,Junpei Zhang,Licheng Jiao,Fang Liu,Xinyu Liu,Jing Zhang,Kexin Zhang,Xu Liu,LingLing Li,Hao Fang,Feiyu Pan,Xiankai Lu,Wei Zhang,Runmin Cong,Tuyen Tran,Bin Cao,Yisi Zhang,Hanyi Wang,Xingjian He,Jing Liu
关键词-EN: Video Object Segmentation, Large-scale Video Object, video segmentation models, current video segmentation, Video Object
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024 LSVOS Challenge Report: this https URL

点击查看摘要

Abstract:Despite the promising performance of current video segmentation models on existing benchmarks, these models still struggle with complex scenes. In this paper, we introduce the 6th Large-scale Video Object Segmentation (LSVOS) challenge in conjunction with ECCV 2024 workshop. This year’s challenge includes two tasks: Video Object Segmentation (VOS) and Referring Video Object Segmentation (RVOS). In this year, we replace the classic YouTube-VOS and YouTube-RVOS benchmark with latest datasets MOSE, LVOS, and MeViS to assess VOS under more challenging complex environments. This year’s challenge attracted 129 registered teams from more than 20 institutes across over 8 countries. This report include the challenge and dataset introduction, and the methods used by top 7 teams in two tracks. More details can be found in our homepage this https URL.

[CV-5] Vision-Driven 2D Supervised Fine-Tuning Framework for Birds Eye View Perception

链接: https://arxiv.org/abs/2409.05834
作者: Lei He,Qiaoyi Wang,Honglin Sun,Qing Xu,Bolin Gao,Shengbo Eben Li,Jianqiang Wang,Keqiang Li
关键词-EN: bird eye view, progressively replacing costly, replacing costly LiDAR-based, Visual bird eye, urban intelligent driving
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Visual bird’s eye view (BEV) perception, due to its excellent perceptual capabilities, is progressively replacing costly LiDAR-based perception systems, especially in the realm of urban intelligent driving. However, this type of perception still relies on LiDAR data to construct ground truth databases, a process that is both cumbersome and time-consuming. Moreover, most massproduced autonomous driving systems are only equipped with surround camera sensors and lack LiDAR data for precise annotation. To tackle this challenge, we propose a fine-tuning method for BEV perception network based on visual 2D semantic perception, aimed at enhancing the model’s generalization capabilities in new scene data. Considering the maturity and development of 2D perception technologies, our method significantly reduces the dependency on high-cost BEV ground truths and shows promising industrial application prospects. Extensive experiments and comparative analyses conducted on the nuScenes and Waymo public datasets demonstrate the effectiveness of our proposed method.

[CV-6] GASP: Gaussian Splatting for Physic-Based Simulations

链接: https://arxiv.org/abs/2409.05819
作者: Piotr Borycki,Weronika Smolak,Joanna Waczyńska,Marcin Mazur,Sławomir Tadeja,Przemysław Spurek
关键词-EN: real-world applications, Gaussian Splatting, Gaussian, Gaussian components, Physics
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Physics simulation is paramount for modeling and utilization of 3D scenes in various real-world applications. However, its integration with state-of-the-art 3D scene rendering techniques such as Gaussian Splatting (GS) remains challenging. Existing models use additional meshing mechanisms, including triangle or tetrahedron meshing, marching cubes, or cage meshes. As an alternative, we can modify the physics grounded Newtonian dynamics to align with 3D Gaussian components. Current models take the first-order approximation of a deformation map, which locally approximates the dynamics by linear transformations. In contrast, our Gaussian Splatting for Physics-Based Simulations (GASP) model uses such a map (without any modifications) and flat Gaussian distributions, which are parameterized by three points (mesh faces). Subsequently, each 3D point (mesh face node) is treated as a discrete entity within a 3D space. Consequently, the problem of modeling Gaussian components is reduced to working with 3D points. Additionally, the information on mesh faces can be used to incorporate further properties into the physics model, facilitating the use of triangles. Resulting solution can be integrated into any physics engine that can be treated as a black box. As demonstrated in our studies, the proposed model exhibits superior performance on a diverse range of benchmark datasets designed for 3D object rendering.

[CV-7] VFA: Vision Frequency Analysis of Foundation Models and Human

链接: https://arxiv.org/abs/2409.05817
作者: Mohammad-Javad Darvishi-Bayazi,Md Rifat Arefin,Jocelyn Faubert,Irina Rish
关键词-EN: Machine learning models, exhibit robust adaptation, Machine learning, humans exhibit robust, real-world scenarios
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Machine learning models often struggle with distribution shifts in real-world scenarios, whereas humans exhibit robust adaptation. Models that better align with human perception may achieve higher out-of-distribution generalization. In this study, we investigate how various characteristics of large-scale computer vision models influence their alignment with human capabilities and robustness. Our findings indicate that increasing model and data size and incorporating rich semantic information and multiple modalities enhance models’ alignment with human perception and their overall robustness. Our empirical analysis demonstrates a strong correlation between out-of-distribution accuracy and human alignment.

[CV-8] Input Space Mode Connectivity in Deep Neural Networks

链接: https://arxiv.org/abs/2409.05800
作者: Jakub Vrabel,Ori Shem-Ur,Yaron Oz,David Krueger
关键词-EN: deep neural networks, loss landscape mode, landscape mode connectivity, mode connectivity, extend the concept
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We extend the concept of loss landscape mode connectivity to the input space of deep neural networks. Mode connectivity was originally studied within parameter space, where it describes the existence of low-loss paths between different solutions (loss minimizers) obtained through gradient descent. We present theoretical and empirical evidence of its presence in the input space of deep networks, thereby highlighting the broader nature of the phenomenon. We observe that different input images with similar predictions are generally connected, and for trained models, the path tends to be simple, with only a small deviation from being a linear path. Our methodology utilizes real, interpolated, and synthetic inputs created using the input optimization technique for feature visualization. We conjecture that input space mode connectivity in high-dimensional spaces is a geometric effect that takes place even in untrained models and can be explained through percolation theory. We exploit mode connectivity to obtain new insights about adversarial examples and demonstrate its potential for adversarial detection. Additionally, we discuss applications for the interpretability of deep networks.

[CV-9] Leveraging Object Priors for Point Tracking ECCV2024

链接: https://arxiv.org/abs/2409.05786
作者: Bikram Boote,Anh Thai,Wenqi Jia,Ozgur Kara,Stefan Stojanov,James M. Rehg,Sangmin Lee
关键词-EN: Point tracking, fundamental problem, problem in computer, computer vision, vision with numerous
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: ECCV 2024 ILR Workshop

点击查看摘要

Abstract:Point tracking is a fundamental problem in computer vision with numerous applications in AR and robotics. A common failure mode in long-term point tracking occurs when the predicted point leaves the object it belongs to and lands on the background or another object. We identify this as the failure to correctly capture objectness properties in learning to track. To address this limitation of prior work, we propose a novel objectness regularization approach that guides points to be aware of object priors by forcing them to stay inside the the boundaries of object instances. By capturing objectness cues at training time, we avoid the need to compute object masks during testing. In addition, we leverage contextual attention to enhance the feature representation for capturing objectness at the feature level more effectively. As a result, our approach achieves state-of-the-art performance on three point tracking benchmarks, and we further validate the effectiveness of our components via ablation studies. The source code is available at: this https URL

[CV-10] Creativity and Visual Communication from Machine to Musician: Sharing a Score through a Robotic Camera

链接: https://arxiv.org/abs/2409.05773
作者: Ross Greer,Laura Fleig,Shlomo Dubnov
关键词-EN: Guided Harmony, interaction by implementing, Guided, Harmony, paper explores
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This paper explores the integration of visual communication and musical interaction by implementing a robotic camera within a “Guided Harmony” musical game. We aim to examine co-creative behaviors between human musicians and robotic systems. Our research explores existing methodologies like improvisational game pieces and extends these concepts to include robotic participation using a PTZ camera. The robotic system interprets and responds to nonverbal cues from musicians, creating a collaborative and adaptive musical experience. This initial case study underscores the importance of intuitive visual communication channels. We also propose future research directions, including parameters for refining the visual cue toolkit and data collection methods to understand human-machine co-creativity further. Our findings contribute to the broader understanding of machine intelligence in augmenting human creativity, particularly in musical settings.

[CV-11] ReL-SAR: Representation Learning for Skeleton Action Recognition with Convolutional Transformers and BYOL

链接: https://arxiv.org/abs/2409.05749
作者: Safwen Naimi,Wassim Bouachir,Guillaume-Alexandre Bilodeau
关键词-EN: challenging task hindered, action recognition features, skeleton action recognition, generalizable skeleton action, large amounts
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8 pages, 4 figures, 6 tables

点击查看摘要

Abstract:To extract robust and generalizable skeleton action recognition features, large amounts of well-curated data are typically required, which is a challenging task hindered by annotation and computation costs. Therefore, unsupervised representation learning is of prime importance to leverage unlabeled skeleton data. In this work, we investigate unsupervised representation learning for skeleton action recognition. For this purpose, we designed a lightweight convolutional transformer framework, named ReL-SAR, exploiting the complementarity of convolutional and attention layers for jointly modeling spatial and temporal cues in skeleton sequences. We also use a Selection-Permutation strategy for skeleton joints to ensure more informative descriptions from skeletal data. Finally, we capitalize on Bootstrap Your Own Latent (BYOL) to learn robust representations from unlabeled skeleton sequence data. We achieved very competitive results on limited-size datasets: MCAD, IXMAS, JHMDB, and NW-UCLA, showing the effectiveness of our proposed method against state-of-the-art methods in terms of both performance and computational efficiency. To ensure reproducibility and reusability, the source code including all implementation parameters is provided at: this https URL

[CV-12] Robust Loss Functions for Object Grasping under Limited Ground Truth

链接: https://arxiv.org/abs/2409.05742
作者: Yangfan Deng,Mengyao Zhang,Yong Zhao
关键词-EN: crucial technology enabling, technology enabling robots, environment sufficiently, Object grasping, crucial technology
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Object grasping is a crucial technology enabling robots to perceive and interact with the environment sufficiently. However, in practical applications, researchers are faced with missing or noisy ground truth while training the convolutional neural network, which decreases the accuracy of the model. Therefore, different loss functions are proposed to deal with these problems to improve the accuracy of the neural network. For missing ground truth, a new predicted category probability method is defined for unlabeled samples, which works effectively in conjunction with the pseudo-labeling method. Furthermore, for noisy ground truth, a symmetric loss function is introduced to resist the corruption of label noises. The proposed loss functions are powerful, robust, and easy to use. Experimental results based on the typical grasping neural network show that our method can improve performance by 2 to 13 percent.

[CV-13] Referring Expression Generation in Visually Grounded Dialogue with Discourse-aware Comprehension Guiding

链接: https://arxiv.org/abs/2409.05721
作者: Bram Willemsen,Gabriel Skantze
关键词-EN: referring expression generation, produce referring expressions, visually grounded dialogue, expression generation, referring expression
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for publication at INLG 2024

点击查看摘要

Abstract:We propose an approach to referring expression generation (REG) in visually grounded dialogue that is meant to produce referring expressions (REs) that are both discriminative and discourse-appropriate. Our method constitutes a two-stage process. First, we model REG as a text- and image-conditioned next-token prediction task. REs are autoregressively generated based on their preceding linguistic context and a visual representation of the referent. Second, we propose the use of discourse-aware comprehension guiding as part of a generate-and-rerank strategy through which candidate REs generated with our REG model are reranked based on their discourse-dependent discriminatory power. Results from our human evaluation indicate that our proposed two-stage approach is effective in producing discriminative REs, with higher performance in terms of text-image retrieval accuracy for reranked REs compared to those generated using greedy decoding.

[CV-14] Boosting CNN-based Handwriting Recognition Systems with Learnable Relaxation Labeling

链接: https://arxiv.org/abs/2409.05699
作者: Sara Ferro,Alessandro Torcinovich,Arianna Traviglia,Marcello Pelillo
关键词-EN: long-range contextual dependencies, managing long-range contextual, primary challenge, lies in managing, managing long-range
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 26 pages, 3 figures

点击查看摘要

Abstract:The primary challenge for handwriting recognition systems lies in managing long-range contextual dependencies, an issue that traditional models often struggle with. To mitigate it, attention mechanisms have recently been employed to enhance context-aware labelling, thereby achieving state-of-the-art performance. In the field of pattern recognition and image analysis, however, the use of contextual information in labelling problems has a long history and goes back at least to the early 1970’s. Among the various approaches developed in those years, Relaxation Labelling (RL) processes have played a prominent role and have been the method of choice in the field for more than a decade. Contrary to recent transformer-based architectures, RL processes offer a principled approach to the use of contextual constraints, having a solid theoretic foundation grounded on variational inequality and game theory, as well as effective algorithms with convergence guarantees. In this paper, we propose a novel approach to handwriting recognition that integrates the strengths of two distinct methodologies. In particular, we propose integrating (trainable) RL processes with various well-established neural architectures and we introduce a sparsification technique that accelerates the convergence of the algorithm and enhances the overall system’s performance. Experiments over several benchmark datasets show that RL processes can improve the generalisation ability, even surpassing in some cases transformer-based architectures.

[CV-15] Segmentation by Factorization: Unsupervised Semantic Segmentation for Pathology by Factorizing Foundation Model Features

链接: https://arxiv.org/abs/2409.05697
作者: Jacob Gildenblat,Ofir Hadar
关键词-EN: deep learning models, deep learning, pre-trained deep learning, Segmentation, Genome Atlas Program
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Segmentation by Factorization (F-SEG), an unsupervised segmentation method for pathology that generates segmentation masks from pre-trained deep learning models. F-SEG allows the use of pre-trained deep neural networks, including recently developed pathology foundation models, for semantic segmentation. It achieves this without requiring additional training or finetuning, by factorizing the spatial features extracted by the models into segmentation masks and their associated concept features. We create generic tissue phenotypes for HE images by training clustering models for multiple numbers of clusters on features extracted from several deep learning models on The Cancer Genome Atlas Program (TCGA), and then show how the clusters can be used for factorizing corresponding segmentation masks using off-the-shelf deep learning models. Our results show that F-SEG provides robust unsupervised segmentation capabilities for HE pathology images, and that the segmentation quality is greatly improved by utilizing pathology foundation models. We discuss and propose methods for evaluating the performance of unsupervised segmentation in pathology.

[CV-16] LayeredFlow: A Real-World Benchmark for Non-Lambertian Multi-Layer Optical Flow ECCV2024

链接: https://arxiv.org/abs/2409.05688
作者: Hongyu Wen,Erich Liang,Jia Deng
关键词-EN: existing algorithms struggle, optical flow, non-Lambertian objects, objects, algorithms struggle
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024

点击查看摘要

Abstract:Achieving 3D understanding of non-Lambertian objects is an important task with many useful applications, but most existing algorithms struggle to deal with such objects. One major obstacle towards progress in this field is the lack of holistic non-Lambertian benchmarks – most benchmarks have low scene and object diversity, and none provide multi-layer 3D annotations for objects occluded by transparent surfaces. In this paper, we introduce LayeredFlow, a real world benchmark containing multi-layer ground truth annotation for optical flow of non-Lambertian objects. Compared to previous benchmarks, our benchmark exhibits greater scene and object diversity, with 150k high quality optical flow and stereo pairs taken over 185 indoor and outdoor scenes and 360 unique objects. Using LayeredFlow as evaluation data, we propose a new task called multi-layer optical flow. To provide training data for this task, we introduce a large-scale densely-annotated synthetic dataset containing 60k images within 30 scenes tailored for non-Lambertian objects. Training on our synthetic dataset enables model to predict multi-layer optical flow, while fine-tuning existing optical flow methods on the dataset notably boosts their performance on non-Lambertian objects without compromising the performance on diffuse objects. Data is available at this https URL.

[CV-17] SX-Stitch: An Efficient VMS-UNet Based Framework for Intraoperative Scoliosis X-Ray Image Stitching

链接: https://arxiv.org/abs/2409.05681
作者: Yi Li,Heting Gao,Mingde He,Jinqian Liang,Jason Gu,Wei Liu
关键词-EN: C-arm X-ray machine, X-ray machine restricts, robust intraoperative X-ray, intraoperative X-ray image, surgeons’ holistic analysis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In scoliosis surgery, the limited field of view of the C-arm X-ray machine restricts the surgeons’ holistic analysis of spinal structures .This paper presents an end-to-end efficient and robust intraoperative X-ray image stitching method for scoliosis surgery,named SX-Stitch. The method is divided into two stages:segmentation and stitching. In the segmentation stage, We propose a medical image segmentation model named Vision Mamba of Spine-UNet (VMS-UNet), which utilizes the state space Mamba to capture long-distance contextual information while maintaining linear computational complexity, and incorporates the SimAM attention mechanism, significantly improving the segmentation this http URL the stitching stage, we simplify the alignment process between images to the minimization of a registration energy function. The total energy function is then optimized to order unordered images, and a hybrid energy function is introduced to optimize the best seam, effectively eliminating parallax artifacts. On the clinical dataset, Sx-Stitch demonstrates superiority over SOTA schemes both qualitatively and quantitatively.

[CV-18] AnomalyCD: A benchmark for Earth anomaly change detection with high-resolution and time-series observations

链接: https://arxiv.org/abs/2409.05679
作者: Jingtao Li,Qian Zhu,Xinyu Wang,Hengwei Zhao,Yanfei Zhong
关键词-EN: balanced state, destroyed the stable, resulting in fatalities, destruction of property, Earth anomalies
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: remote sensing benchmark

点击查看摘要

Abstract:Various Earth anomalies have destroyed the stable, balanced state, resulting in fatalities and serious destruction of property. With the advantages of large-scale and precise observation, high-resolution remote sensing images have been widely used for anomaly monitoring and localization. Powered by the deep representation, the existing methods have achieved remarkable advances, primarily in classification and change detection techniques. However, labeled samples are difficult to acquire due to the low probability of anomaly occurrence, and the trained models are limited to fixed anomaly categories, which hinders the application for anomalies with few samples or unknown anomalies. In this paper, to tackle this problem, we propose the anomaly change detection (AnomalyCD) technique, which accepts time-series observations and learns to identify anomalous changes by learning from the historical normal change pattern. Compared to the existing techniques, AnomalyCD processes an unfixed number of time steps and can localize the various anomalies in a unified manner, without human supervision. To benchmark AnomalyCD, we constructed a high-resolution dataset with time-series images dedicated to various Earth anomalies (the AnomalyCDD dataset). AnomalyCDD contains high-resolution (from 0.15 to 2.39 m/pixel), time-series (from 3 to 7 time steps), and large-scale images (1927.93 km2 in total) collected globally Furthermore, we developed a zero-shot baseline model (AnomalyCDM), which implements the AnomalyCD technique by extracting a general representation from the segment anything model (SAM) and conducting temporal comparison to distinguish the anomalous changes from normal changes. AnomalyCDM is designed as a two-stage workflow to enhance the efficiency, and has the ability to process the unseen images directly, without retraining for each scene.

[CV-19] Real-Time Human Action Recognition on Embedded Platforms

链接: https://arxiv.org/abs/2409.05662
作者: Ruiqi Wang,Zichen Wang,Peiqi Gao,Mingzhen Li,Jaehwan Jeong,Yihang Xu,Yejin Lee,Lisa Connor,Chenyang Lu
关键词-EN: video-based human action, human action recognition, motion feature extractor, video-based human, Integrated Motion Feature
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With advancements in computer vision and deep learning, video-based human action recognition (HAR) has become practical. However, due to the complexity of the computation pipeline, running HAR on live video streams incurs excessive delays on embedded platforms. This work tackles the real-time performance challenges of HAR with four contributions: 1) an experimental study identifying a standard Optical Flow (OF) extraction technique as the latency bottleneck in a state-of-the-art HAR pipeline, 2) an exploration of the latency-accuracy tradeoff between the standard and deep learning approaches to OF extraction, which highlights the need for a novel, efficient motion feature extractor, 3) the design of Integrated Motion Feature Extractor (IMFE), a novel single-shot neural network architecture for motion feature extraction with drastic improvement in latency, 4) the development of RT-HARE, a real-time HAR system tailored for embedded platforms. Experimental results on an Nvidia Jetson Xavier NX platform demonstrated that RT-HARE realizes real-time HAR at a video frame rate of 30 frames per second while delivering high levels of recognition accuracy.

[CV-20] Replay Consolidation with Label Propagation for Continual Object Detection

链接: https://arxiv.org/abs/2409.05650
作者: Riccardo De Monte,Davide Dalle Pezze,Marina Ceccon,Francesco Pasti,Francesco Paissan,Elisabetta Farella,Gian Antonio Susto,Nicola Bellotto
关键词-EN: highly relevant computer, relevant computer vision, computer vision problem, Object Detection, Continual Learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Object Detection is a highly relevant computer vision problem with many applications such as robotics and autonomous driving. Continual Learning~(CL) considers a setting where a model incrementally learns new information while retaining previously acquired knowledge. This is particularly challenging since Deep Learning models tend to catastrophically forget old knowledge while training on new data. In particular, Continual Learning for Object Detection~(CLOD) poses additional difficulties compared to CL for Classification. In CLOD, images from previous tasks may contain unknown classes that could reappear labeled in future tasks. These missing annotations cause task interference issues for replay-based approaches. As a result, most works in the literature have focused on distillation-based approaches. However, these approaches are effective only when there is a strong overlap of classes across tasks. To address the issues of current methodologies, we propose a novel technique to solve CLOD called Replay Consolidation with Label Propagation for Object Detection (RCLPOD). Based on the replay method, our solution avoids task interference issues by enhancing the buffer memory samples. Our method is evaluated against existing techniques in CLOD literature, demonstrating its superior performance on established benchmarks like VOC and COCO.

[CV-21] Prototype-Driven Multi-Feature Generation for Visible-Infrared Person Re-identification

链接: https://arxiv.org/abs/2409.05642
作者: Jiarui Li,Zhen Qiu,Yilin Yang,Yuqi Li,Zeyu Dong,Chuanguang Yang
关键词-EN: visible-infrared person re-identification, person re-identification arise, including inter-modal, intra-modal variations, visible-infrared person
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages

点击查看摘要

Abstract:The primary challenges in visible-infrared person re-identification arise from the differences between visible (vis) and infrared (ir) images, including inter-modal and intra-modal variations. These challenges are further complicated by varying viewpoints and irregular movements. Existing methods often rely on horizontal partitioning to align part-level features, which can introduce inaccuracies and have limited effectiveness in reducing modality discrepancies. In this paper, we propose a novel Prototype-Driven Multi-feature generation framework (PDM) aimed at mitigating cross-modal discrepancies by constructing diversified features and mining latent semantically similar features for modal alignment. PDM comprises two key components: Multi-Feature Generation Module (MFGM) and Prototype Learning Module (PLM). The MFGM generates diversity features closely distributed from modality-shared features to represent pedestrians. Additionally, the PLM utilizes learnable prototypes to excavate latent semantic similarities among local features between visible and infrared modalities, thereby facilitating cross-modal instance-level alignment. We introduce the cosine heterogeneity loss to enhance prototype diversity for extracting rich local features. Extensive experiments conducted on the SYSU-MM01 and LLCM datasets demonstrate that our approach achieves state-of-the-art performance. Our codes are available at this https URL.

[CV-22] 3D-SAR Tomography and Machine Learning for High-Resolution Tree Height Estimation

链接: https://arxiv.org/abs/2409.05636
作者: Grace Colverd,Jumpei Takami,Laura Schade,Karol Bot,Joseph A. Gallego-Mejia
关键词-EN: Accurately estimating forest, Synthetic Aperture Radar, Accurately estimating, climate change mitigation, change mitigation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurately estimating forest biomass is crucial for global carbon cycle modelling and climate change mitigation. Tree height, a key factor in biomass calculations, can be measured using Synthetic Aperture Radar (SAR) technology. This study applies machine learning to extract forest height data from two SAR products: Single Look Complex (SLC) images and tomographic cubes, in preparation for the ESA Biomass Satellite mission. We use the TomoSense dataset, containing SAR and LiDAR data from Germany’s Eifel National Park, to develop and evaluate height estimation models. Our approach includes classical methods, deep learning with a 3D U-Net, and Bayesian-optimized techniques. By testing various SAR frequencies and polarimetries, we establish a baseline for future height and biomass modelling. Best-performing models predict forest height to be within 2.82m mean absolute error for canopies around 30m, advancing our ability to measure global carbon stocks and support climate action.

[CV-23] Renormalized Connection for Scale-preferred Object Detection in Satellite Imagery

链接: https://arxiv.org/abs/2409.05624
作者: Fan Zhang,Lingling Li,Licheng Jiao,Xu Liu,Fang Liu,Shuyuan Yang,Biao Hou
关键词-EN: small objects, Knowledge Discovery Network, Satellite imagery, long-range imaging, making the precise
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 24 pages, 14 figures Journal

点击查看摘要

Abstract:Satellite imagery, due to its long-range imaging, brings with it a variety of scale-preferred tasks, such as the detection of tiny/small objects, making the precise localization and detection of small objects of interest a challenging task. In this article, we design a Knowledge Discovery Network (KDN) to implement the renormalization group theory in terms of efficient feature extraction. Renormalized connection (RC) on the KDN enables synergistic focusing'' of multi-scale features. Based on our observations of KDN, we abstract a class of RCs with different connection strengths, called n21C, and generalize it to FPN-based multi-branch detectors. In a series of FPN experiments on the scale-preferred tasks, we found that the divide-and-conquer’’ idea of FPN severely hampers the detector’s learning in the right direction due to the large number of large-scale negative samples and interference from background noise. Moreover, these negative samples cannot be eliminated by the focal loss function. The RCs extends the multi-level feature’s ``divide-and-conquer’’ mechanism of the FPN-based detectors to a wide range of scale-preferred tasks, and enables synergistic effects of multi-level features on the specific learning goal. In addition, interference activations in two aspects are greatly reduced and the detector learns in a more correct direction. Extensive experiments of 17 well-designed detection architectures embedded with n21s on five different levels of scale-preferred tasks validate the effectiveness and efficiency of the RCs. Especially the simplest linear form of RC, E421C performs well in all tasks and it satisfies the scaling property of RGT. We hope that our approach will transfer a large number of well-designed detectors from the computer vision community to the remote sensing community.

[CV-24] G-NeLF: Memory- and Data-Efficient Hybrid Neural Light Field for Novel View Synthesis

链接: https://arxiv.org/abs/2409.05617
作者: Lutao Jiang,Lin Wang
关键词-EN: Neural Light Field, Neural Radiance Field, Light Field, Neural Light, implicit neural representation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Following the burgeoning interest in implicit neural representation, Neural Light Field (NeLF) has been introduced to predict the color of a ray directly. Unlike Neural Radiance Field (NeRF), NeLF does not create a point-wise representation by predicting color and volume density for each point in space. However, the current NeLF methods face a challenge as they need to train a NeRF model first and then synthesize over 10K views to train NeLF for improved performance. Additionally, the rendering quality of NeLF methods is lower compared to NeRF methods. In this paper, we propose G-NeLF, a versatile grid-based NeLF approach that utilizes spatial-aware features to unleash the potential of the neural network’s inference capability, and consequently overcome the difficulties of NeLF training. Specifically, we employ a spatial-aware feature sequence derived from a meticulously crafted grid as the ray’s representation. Drawing from our empirical studies on the adaptability of multi-resolution hash tables, we introduce a novel grid-based ray representation for NeLF that can represent the entire space with a very limited number of parameters. To better utilize the sequence feature, we design a lightweight ray color decoder that simulates the ray propagation process, enabling a more efficient inference of the ray’s color. G-NeLF can be trained without necessitating significant storage overhead and with the model size of only 0.95 MB to surpass previous state-of-the-art NeLF. Moreover, compared with grid-based NeRF methods, e.g., Instant-NGP, we only utilize one-tenth of its parameters to achieve higher performance. Our code will be released upon acceptance.

[CV-25] Adapted-MoE: Mixture of Experts with Test-Time Adaption for Anomaly Detection

链接: https://arxiv.org/abs/2409.05611
作者: Tianwu Lei,Silin Chen,Bohan Wang,Zhengkai Jiang,Ningmu Zou
关键词-EN: made remarkable progress, unsupervised anomaly detection, recently made remarkable, anomaly detection methods, detection methods based
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Most unsupervised anomaly detection methods based on representations of normal samples to distinguish anomalies have recently made remarkable progress. However, existing methods only learn a single decision boundary for distinguishing the samples within the training dataset, neglecting the variation in feature distribution for normal samples even in the same category in the real world. Furthermore, it was not considered that a distribution bias still exists between the test set and the train set. Therefore, we propose an Adapted-MoE which contains a routing network and a series of expert models to handle multiple distributions of same-category samples by divide and conquer. Specifically, we propose a routing network based on representation learning to route same-category samples into the subclasses feature space. Then, a series of expert models are utilized to learn the representation of various normal samples and construct several independent decision boundaries. We propose the test-time adaption to eliminate the bias between the unseen test sample representation and the feature distribution learned by the expert model. Our experiments are conducted on a dataset that provides multiple subclasses from three categories, namely Texture AD benchmark. The Adapted-MoE significantly improves the performance of the baseline model, achieving 2.18%-7.20% and 1.57%-16.30% increase in I-AUROC and P-AUROC, which outperforms the current state-of-the-art methods. Our code is available at this https URL.

[CV-26] CustomContrast: A Multilevel Contrastive Perspective For Subject-Driven Text-to-Image Customization

链接: https://arxiv.org/abs/2409.05606
作者: Nan Chen,Mengqi Huang,Zhuowei Chen,Yang Zheng,Lei Zhang,Zhendong Mao
关键词-EN: drawn significant interest, contrastive learning, customization has drawn, academia and industry, drawn significant
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Subject-driven text-to-image (T2I) customization has drawn significant interest in academia and industry. This task enables pre-trained models to generate novel images based on unique subjects. Existing studies adopt a self-reconstructive perspective, focusing on capturing all details of a single image, which will misconstrue the specific image’s irrelevant attributes (e.g., view, pose, and background) as the subject intrinsic attributes. This misconstruction leads to both overfitting or underfitting of irrelevant and intrinsic attributes of the subject, i.e., these attributes are over-represented or under-represented simultaneously, causing a trade-off between similarity and controllability. In this study, we argue an ideal subject representation can be achieved by a cross-differential perspective, i.e., decoupling subject intrinsic attributes from irrelevant attributes via contrastive learning, which allows the model to focus more on intrinsic attributes through intra-consistency (features of the same subject are spatially closer) and inter-distinctiveness (features of different subjects have distinguished differences). Specifically, we propose CustomContrast, a novel framework, which includes a Multilevel Contrastive Learning (MCL) paradigm and a Multimodal Feature Injection (MFI) Encoder. The MCL paradigm is used to extract intrinsic features of subjects from high-level semantics to low-level appearance through crossmodal semantic contrastive learning and multiscale appearance contrastive learning. To facilitate contrastive learning, we introduce the MFI encoder to capture cross-modal representations. Extensive experiments show the effectiveness of CustomContrast in subject similarity and text controllability.

[CV-27] SynMorph: Generating Synthetic Face Morphing Dataset with Mated Samples

链接: https://arxiv.org/abs/2409.05595
作者: Haoyu Zhang,Raghavendra Ramachandra,Kiran Raja,Christoph Busch
关键词-EN: synthetic face morphing, morphing attack detection, face morphing dataset, face recognition systems, proposed synthetic face
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Face morphing attack detection (MAD) algorithms have become essential to overcome the vulnerability of face recognition systems. To solve the lack of large-scale and public-available datasets due to privacy concerns and restrictions, in this work we propose a new method to generate a synthetic face morphing dataset with 2450 identities and more than 100k morphs. The proposed synthetic face morphing dataset is unique for its high-quality samples, different types of morphing algorithms, and the generalization for both single and differential morphing attack detection algorithms. For experiments, we apply face image quality assessment and vulnerability analysis to evaluate the proposed synthetic face morphing dataset from the perspective of biometric sample quality and morphing attack potential on face recognition systems. The results are benchmarked with an existing SOTA synthetic dataset and a representative non-synthetic and indicate improvement compared with the SOTA. Additionally, we design different protocols and study the applicability of using the proposed synthetic dataset on training morphing attack detection algorithms.

[CV-28] DSDFormer: An Innovative Transformer-Mamba Framework for Robust High-Precision Driver Distraction Identification

链接: https://arxiv.org/abs/2409.05587
作者: Junzhou Chen,Zirui Zhang,Jing Yu,Heqiang Huang,Ronghui Zhang,Xuemiao Xu,Bin Sheng,Hong Yan
关键词-EN: Driver distraction remains, Driver distraction, traffic accidents, posing a critical, remains a leading
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Driver distraction remains a leading cause of traffic accidents, posing a critical threat to road safety globally. As intelligent transportation systems evolve, accurate and real-time identification of driver distraction has become essential. However, existing methods struggle to capture both global contextual and fine-grained local features while contending with noisy labels in training datasets. To address these challenges, we propose DSDFormer, a novel framework that integrates the strengths of Transformer and Mamba architectures through a Dual State Domain Attention (DSDA) mechanism, enabling a balance between long-range dependencies and detailed feature extraction for robust driver behavior recognition. Additionally, we introduce Temporal Reasoning Confident Learning (TRCL), an unsupervised approach that refines noisy labels by leveraging spatiotemporal correlations in video sequences. Our model achieves state-of-the-art performance on the AUC-V1, AUC-V2, and 100-Driver datasets and demonstrates real-time processing efficiency on the NVIDIA Jetson AGX Orin platform. Extensive experimental results confirm that DSDFormer and TRCL significantly improve both the accuracy and robustness of driver distraction detection, offering a scalable solution to enhance road safety.

[CV-29] Latent 3D Brain MRI Counterfactual

链接: https://arxiv.org/abs/2409.05585
作者: Wei Peng,Tian Xia,Fabio De Sousa Ribeiro,Tomas Bosschieter,Ehsan Adeli,Qingyu Zhao,Ben Glocker,Kilian M. Pohl
关键词-EN: properly train deep, train deep learning, deep learning models, brain MRI studies, number of samples
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The number of samples in structural brain MRI studies is often too small to properly train deep learning models. Generative models show promise in addressing this issue by effectively learning the data distribution and generating high-fidelity MRI. However, they struggle to produce diverse, high-quality data outside the distribution defined by the training data. One way to address the issue is using causal models developed for 3D volume counterfactuals. However, accurately modeling causality in high-dimensional spaces is a challenge so that these models generally generate 3D brain MRIS of lower quality. To address these challenges, we propose a two-stage method that constructs a Structural Causal Model (SCM) within the latent space. In the first stage, we employ a VQ-VAE to learn a compact embedding of the MRI volume. Subsequently, we integrate our causal model into this latent space and execute a three-step counterfactual procedure using a closed-form Generalized Linear Model (GLM). Our experiments conducted on real-world high-resolution MRI data (1mm) demonstrate that our method can generate high-quality 3D MRI counterfactuals.

[CV-30] LEROjD: Lidar Extended Radar-Only Object Detection ECCV2024

链接: https://arxiv.org/abs/2409.05564
作者: Patrick Palmer,Martin Krüger,Stefan Schütte,Richard Altendorfer,Ganesh Adam,Torsten Bertram
关键词-EN: automated driving, vital for automated, Accurate, object detectors, radar-only object detectors
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Accepted for publication as ECCV 2024

点击查看摘要

Abstract:Accurate 3D object detection is vital for automated driving. While lidar sensors are well suited for this task, they are expensive and have limitations in adverse weather conditions. 3+1D imaging radar sensors offer a cost-effective, robust alternative but face challenges due to their low resolution and high measurement noise. Existing 3+1D imaging radar datasets include radar and lidar data, enabling cross-modal model improvements. Although lidar should not be used during inference, it can aid the training of radar-only object detectors. We explore two strategies to transfer knowledge from the lidar to the radar domain and radar-only object detectors: 1. multi-stage training with sequential lidar point cloud thin-out, and 2. cross-modal knowledge distillation. In the multi-stage process, three thin-out methods are examined. Our results show significant performance gains of up to 4.2 percentage points in mean Average Precision with multi-stage training and up to 3.9 percentage points with knowledge distillation by initializing the student with the teacher’s weights. The main benefit of these approaches is their applicability to other 3D object detection networks without altering their architecture, as we show by analyzing it on two different object detectors. Our code is available at this https URL

[CV-31] Seeing Through the Mask: Rethinking Adversarial Examples for CAPTCHAs

链接: https://arxiv.org/abs/2409.05558
作者: Yahya Jabary,Andreas Plesner,Turlan Kuzhagaliyev,Roger Wattenhofer
关键词-EN: CAPTCHAs rely heavily, rely heavily, Modern CAPTCHAs rely, models, CAPTCHAs rely
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:Modern CAPTCHAs rely heavily on vision tasks that are supposedly hard for computers but easy for humans. However, advances in image recognition models pose a significant threat to such CAPTCHAs. These models can easily be fooled by generating some well-hidden “random” noise and adding it to the image, or hiding objects in the image. However, these methods are model-specific and thus can not aid CAPTCHAs in fooling all models. We show in this work that by allowing for more significant changes to the images while preserving the semantic information and keeping it solvable by humans, we can fool many state-of-the-art models. Specifically, we demonstrate that by adding masks of various intensities the Accuracy @ 1 (Acc@1) drops by more than 50%-points for all models, and supposedly robust models such as vision transformers see an Acc@1 drop of 80%-points. These masks can therefore effectively fool modern image classifiers, thus showing that machines have not caught up with humans – yet. Comments: Under review Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.05558 [cs.CV] (or arXiv:2409.05558v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.05558 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-32] Seeing is Believing? Enhancing Vision-Language Navigation using Visual Perturbations ICASSP2025

链接: https://arxiv.org/abs/2409.05552
作者: Xuesong Zhang,Jia Li,Yunbo Xu,Zhenzhen Hu,Richang Hong
关键词-EN: natural language instructions, language instructions remains, embodied agent guided, Autonomous navigation, guided by natural
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 2 figures, submitted to ICASSP 2025

点击查看摘要

Abstract:Autonomous navigation for an embodied agent guided by natural language instructions remains a formidable challenge in vision-and-language navigation (VLN). Despite remarkable recent progress in learning fine-grained and multifarious visual representations, the tendency to overfit to the training environments leads to unsatisfactory generalization performance. In this work, we present a versatile Multi-Branch Architecture (MBA) aimed at exploring and exploiting diverse visual inputs. Specifically, we introduce three distinct visual variants: ground-truth depth images, visual inputs integrated with incongruent views, and those infused with random noise to enrich the diversity of visual input representation and prevent overfitting to the original RGB observations. To adaptively fuse these varied inputs, the proposed MBA extend a base agent model into a multi-branch variant, where each branch processes a different visual input. Surprisingly, even random noise can further enhance navigation performance in unseen environments. Extensive experiments conducted on three VLN benchmarks (R2R, REVERIE, SOON) demonstrate that our proposed method equals or even surpasses state-of-the-art results. The source code will be publicly available.

[CV-33] Exploring Rich Subjective Quality Information for Image Quality Assessment in the Wild

链接: https://arxiv.org/abs/2409.05540
作者: Xiongkuo Min,Yixuan Gao,Yuqin Cao,Guangtao Zhai,Wenjun Zhang,Huifang Sun,Chang Wen Chen
关键词-EN: opinion scores, rich subjective quality, subjective quality information, image quality, image quality assessment
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Traditional in the wild image quality assessment (IQA) models are generally trained with the quality labels of mean opinion score (MOS), while missing the rich subjective quality information contained in the quality ratings, for example, the standard deviation of opinion scores (SOS) or even distribution of opinion scores (DOS). In this paper, we propose a novel IQA method named RichIQA to explore the rich subjective rating information beyond MOS to predict image quality in the wild. RichIQA is characterized by two key novel designs: (1) a three-stage image quality prediction network which exploits the powerful feature representation capability of the Convolutional vision Transformer (CvT) and mimics the short-term and long-term memory mechanisms of human brain; (2) a multi-label training strategy in which rich subjective quality information like MOS, SOS and DOS are concurrently used to train the quality prediction network. Powered by these two novel designs, RichIQA is able to predict the image quality in terms of a distribution, from which the mean image quality can be subsequently obtained. Extensive experimental results verify that the three-stage network is tailored to predict rich quality information, while the multi-label training strategy can fully exploit the potentials within subjective quality rating and enhance the prediction performance and generalizability of the network. RichIQA outperforms state-of-the-art competitors on multiple large-scale in the wild IQA databases with rich subjective rating labels. The code of RichIQA will be made publicly available on GitHub.

[CV-34] HMAFlow: Learning More Accurate Optical Flow via Hierarchical Motion Field Alignment

链接: https://arxiv.org/abs/2409.05531
作者: Dianbo Ma,Kousuke Imamura,Ziyan Gao,Xiangjie Wang,Satoshi Yamane
关键词-EN: Optical flow estimation, long-standing visual task, improve optical flow, Optical flow, Motion Field Alignment
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:Optical flow estimation is a fundamental and long-standing visual task. In this work, we present a novel method, dubbed HMAFlow, to improve optical flow estimation in these tough scenes, especially with small objects. The proposed model mainly consists of two core components: a Hierarchical Motion Field Alignment (HMA) module and a Correlation Self-Attention (CSA) module. In addition, we rebuild 4D cost volumes by employing a Multi-Scale Correlation Search (MCS) layer and replacing average pooling in common cost volumes with an search strategy using multiple search ranges. Experimental results demonstrate that our model achieves the best generalization performance in comparison to other state-of-the-art methods. Specifically, compared with RAFT, our method achieves relative error reductions of 14.2% and 3.4% on the clean pass and final pass of the Sintel online benchmark, respectively. On the KITTI test benchmark, HMAFlow surpasses RAFT and GMA in the Fl-all metric by a relative margin of 6.8% and 7.7%, respectively. To facilitate future research, our code will be made available at this https URL.

[CV-35] An Atmospheric Correction Integrated LULC Segmentation Model for High-Resolution Satellite Imagery

链接: https://arxiv.org/abs/2409.05494
作者: Soham Mukherjee,Yash Dixit,Naman Srivastava,Joel D Joy,Rohan Olikara,Koesha Sinha,Swarup E,Rakshit Ramesh
关键词-EN: deep learning models, land cover, measured Digital Number, fine-scale multispectral imagery, integration of fine-scale
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The integration of fine-scale multispectral imagery with deep learning models has revolutionized land use and land cover (LULC) classification. However, the atmospheric effects present in Top-of-Atmosphere sensor measured Digital Number values must be corrected to retrieve accurate Bottom-of-Atmosphere surface reflectance for reliable analysis. This study employs look-up-table-based radiative transfer simulations to estimate the atmospheric path reflectance and transmittance for atmospherically correcting high-resolution CARTOSAT-3 Multispectral (MX) imagery for several Indian cities. The corrected surface reflectance data were subsequently used in supervised and semi-supervised segmentation models, demonstrating stability in multi-class (buildings, roads, trees and water bodies) LULC segmentation accuracy, particularly in scenarios with sparsely labelled data.

[CV-36] A Taxonomy of Miscompressions: Preparing Image Forensics for Neural Compression

链接: https://arxiv.org/abs/2409.05490
作者: Nora Hofer,Rainer Böhme
关键词-EN: revolutionize lossy image, Neural compression, potential to revolutionize, revolutionize lossy, Neural
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages, 6 figures

点击查看摘要

Abstract:Neural compression has the potential to revolutionize lossy image compression. Based on generative models, recent schemes achieve unprecedented compression rates at high perceptual quality but compromise semantic fidelity. Details of decompressed images may appear optically flawless but semantically different from the originals, making compression errors difficult or impossible to detect. We explore the problem space and propose a provisional taxonomy of miscompressions. It defines three types of ‘what happens’ and has a binary ‘high impact’ flag indicating miscompressions that alter symbols. We discuss how the taxonomy can facilitate risk communication and research into mitigations.

[CV-37] PVP-Recon: Progressive View Planning via Warping Consistency for Sparse-View Surface Reconstruction

链接: https://arxiv.org/abs/2409.05474
作者: Sheng Ye,Yuze He,Matthieu Lin,Jenny Sheng,Ruoyu Fan,Yiheng Han,Yubin Hu,Ran Yi,Yu-Hui Wen,Yong-Jin Liu,Wenping Wang
关键词-EN: revolutionized dense multi-view, performance significantly diminishes, Neural implicit representations, dense multi-view surface, implicit representations
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Neural implicit representations have revolutionized dense multi-view surface reconstruction, yet their performance significantly diminishes with sparse input views. A few pioneering works have sought to tackle the challenge of sparse-view reconstruction by leveraging additional geometric priors or multi-scene generalizability. However, they are still hindered by the imperfect choice of input views, using images under empirically determined viewpoints to provide considerable overlap. We propose PVP-Recon, a novel and effective sparse-view surface reconstruction method that progressively plans the next best views to form an optimal set of sparse viewpoints for image capturing. PVP-Recon starts initial surface reconstruction with as few as 3 views and progressively adds new views which are determined based on a novel warping score that reflects the information gain of each newly added view. This progressive view planning progress is interleaved with a neural SDF-based reconstruction module that utilizes multi-resolution hash features, enhanced by a progressive training scheme and a directional Hessian loss. Quantitative and qualitative experiments on three benchmark datasets show that our framework achieves high-quality reconstruction with a constrained input budget and outperforms existing baselines.

[CV-38] Proto-OOD: Enhancing OOD Object Detection with Prototype Feature Similarity

链接: https://arxiv.org/abs/2409.05466
作者: Junkun Chen,Jilin Mei,Liang Chen,Fangzhou Zhao,Yu Hu
关键词-EN: object detectors commonly, limited training samples, detectors commonly result, low accuracy, object detectors
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 14pages

点击查看摘要

Abstract:The limited training samples for object detectors commonly result in low accuracy out-of-distribution (OOD) object detection. We have observed that feature vectors of the same class tend to cluster tightly in feature space, whereas those of different classes are more scattered. This insight motivates us to leverage feature similarity for OOD detection. Drawing on the concept of prototypes prevalent in few-shot learning, we introduce a novel network architecture, Proto-OOD, designed for this purpose. Proto-OOD enhances prototype representativeness through contrastive loss and identifies OOD data by assessing the similarity between input features and prototypes. It employs a negative embedding generator to create negative embedding, which are then used to train the similarity module. Proto-OOD achieves significantly lower FPR95 in MS-COCO dataset and higher mAP for Pascal VOC dataset, when utilizing Pascal VOC as ID dataset and MS-COCO as OOD dataset. Additionally, we identify limitations in existing evaluation metrics and propose an enhanced evaluation protocol.

[CV-39] DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation

链接: https://arxiv.org/abs/2409.05463
作者: Wei Wu,Xi Guo,Weixuan Tang,Tingxuan Huang,Chiyu Wang,Dongyue Chen,Chenjing Ding
关键词-EN: Recent advancements, provided promising solutions, synthesizing realistic driving, provided promising, synthesizing realistic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in generative models have provided promising solutions for synthesizing realistic driving videos, which are crucial for training autonomous driving perception models. However, existing approaches often struggle with multi-view video generation due to the challenges of integrating 3D information while maintaining spatial-temporal consistency and effectively learning from a unified model. In this paper, we propose an end-to-end framework named DriveScape for multi-view, 3D condition-guided video generation. DriveScape not only streamlines the process by integrating camera data to ensure comprehensive spatial-temporal coverage, but also introduces a Bi-Directional Modulated Transformer module to effectively align 3D road structural information. As a result, our approach enables precise control over video generation, significantly enhancing realism and providing a robust solution for generating multi-view driving videos. Our framework achieves state-of-the-art results on the nuScenes dataset, demonstrating impressive generative quality metrics with an FID score of 8.34 and an FVD score of 76.39, as well as superior performance across various perception tasks. This paves the way for more accurate environmental simulations in autonomous driving. Code will be available at our project homepage.

[CV-40] EndoOmni: Zero-Shot Cross-Dataset Depth Estimation in Endoscopy by Robust Self-Learning from Noisy Labels

链接: https://arxiv.org/abs/2409.05442
作者: Qingyao Tian,Zhen Chen,Huai Liao,Xinyan Huang,Lujie Li,Sebastien Ourselin,Hongbin Liu
关键词-EN: Single-image depth estimation, Single-image depth, augmented reality, depth estimation, depth
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Single-image depth estimation is essential for endoscopy tasks such as localization, reconstruction, and augmented reality. Most existing methods in surgical scenes focus on in-domain depth estimation, limiting their real-world applicability. This constraint stems from the scarcity and inferior labeling quality of medical data for training. In this work, we present EndoOmni, the first foundation model for zero-shot cross-domain depth estimation for endoscopy. To harness the potential of diverse training data, we refine the advanced self-learning paradigm that employs a teacher model to generate pseudo-labels, guiding a student model trained on large-scale labeled and unlabeled data. To address training disturbance caused by inherent noise in depth labels, we propose a robust training framework that leverages both depth labels and estimated confidence from the teacher model to jointly guide the student model training. Moreover, we propose a weighted scale-and-shift invariant loss to adaptively adjust learning weights based on label confidence, thus imposing learning bias towards cleaner label pixels while reducing the influence of highly noisy pixels. Experiments on zero-shot relative depth estimation show that our EndoOmni improves state-of-the-art methods in medical imaging for 41% and existing foundation models for 25% in terms of absolute relative error on specific dataset. Furthermore, our model provides strong initialization for fine-tuning to metric depth estimation, maintaining superior performance in both in-domain and out-of-domain scenarios. The source code will be publicly available.

[CV-41] xtToucher: Fine-Grained Text-to-Touch Generation

链接: https://arxiv.org/abs/2409.05427
作者: Jiahang Tu,Hao Fu,Fengyu Yang,Hanbin Zhao,Chao Zhang,Hui Qian
关键词-EN: Tactile, Tactile sensation plays, embodied intelligence, plays a crucial, crucial role
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Tactile sensation plays a crucial role in the development of multi-modal large models and embodied intelligence. To collect tactile data with minimal cost as possible, a series of studies have attempted to generate tactile images by vision-to-touch image translation. However, compared to text modality, visual modality-driven tactile generation cannot accurately depict human tactile sensation. In this work, we analyze the characteristics of tactile images in detail from two granularities: object-level (tactile texture, tactile shape), and sensor-level (gel status). We model these granularities of information through text descriptions and propose a fine-grained Text-to-Touch generation method (TextToucher) to generate high-quality tactile samples. Specifically, we introduce a multimodal large language model to build the text sentences about object-level tactile information and employ a set of learnable text prompts to represent the sensor-level tactile information. To better guide the tactile generation process with the built text information, we fuse the dual grains of text information and explore various dual-grain text conditioning methods within the diffusion transformer architecture. Furthermore, we propose a Contrastive Text-Touch Pre-training (CTTP) metric to precisely evaluate the quality of text-driven generated tactile data. Extensive experiments demonstrate the superiority of our TextToucher method. The source codes will be available at \urlthis https URL.

[CV-42] Distribution Discrepancy and Feature Heterogeneity for Active 3D Object Detection

链接: https://arxiv.org/abs/2409.05425
作者: Huang-Yu Chen,Jia-Fong Yeh,Jia-Wei Liao,Pin-Hsuan Peng,Winston H. Hsu
关键词-EN: object detection, driving and robotics, critical technology, development of autonomous, autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to CoRL 2024

点击查看摘要

Abstract:LiDAR-based 3D object detection is a critical technology for the development of autonomous driving and robotics. However, the high cost of data annotation limits its advancement. We propose a novel and effective active learning (AL) method called Distribution Discrepancy and Feature Heterogeneity (DDFH), which simultaneously considers geometric features and model embeddings, assessing information from both the instance-level and frame-level perspectives. Distribution Discrepancy evaluates the difference and novelty of instances within the unlabeled and labeled distributions, enabling the model to learn efficiently with limited data. Feature Heterogeneity ensures the heterogeneity of intra-frame instance features, maintaining feature diversity while avoiding redundant or similar instances, thus minimizing annotation costs. Finally, multiple indicators are efficiently aggregated using Quantile Transform, providing a unified measure of informativeness. Extensive experiments demonstrate that DDFH outperforms the current state-of-the-art (SOTA) methods on the KITTI and Waymo datasets, effectively reducing the bounding box annotation cost by 56.3% and showing robustness when working with both one-stage and two-stage models.

[CV-43] AD-Net: Attention-based dilated convolutional residual network with guided decoder for robust skin lesion segmentation

链接: https://arxiv.org/abs/2409.05420
作者: Asim Naveed,Syed S. Naqvi,Tariq M. Khan,Shahzaib Iqbal,M. Yaqoob Wani,Haroon Ahmed Khan
关键词-EN: computer-aided diagnosis tools, skin cancer treatment, skin lesion segmentation, diagnosis tools employed, computer-aided diagnosis
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In computer-aided diagnosis tools employed for skin cancer treatment and early diagnosis, skin lesion segmentation is important. However, achieving precise segmentation is challenging due to inherent variations in appearance, contrast, texture, and blurry lesion boundaries. This research presents a robust approach utilizing a dilated convolutional residual network, which incorporates an attention-based spatial feature enhancement block (ASFEB) and employs a guided decoder strategy. In each dilated convolutional residual block, dilated convolution is employed to broaden the receptive field with varying dilation rates. To improve the spatial feature information of the encoder, we employed an attention-based spatial feature enhancement block in the skip connections. The ASFEB in our proposed method combines feature maps obtained from average and maximum-pooling operations. These combined features are then weighted using the active outcome of global average pooling and convolution operations. Additionally, we have incorporated a guided decoder strategy, where each decoder block is optimized using an individual loss function to enhance the feature learning process in the proposed AD-Net. The proposed AD-Net presents a significant benefit by necessitating fewer model parameters compared to its peer methods. This reduction in parameters directly impacts the number of labeled data required for training, facilitating faster convergence during the training process. The effectiveness of the proposed AD-Net was evaluated using four public benchmark datasets. We conducted a Wilcoxon signed-rank test to verify the efficiency of the AD-Net. The outcomes suggest that our method surpasses other cutting-edge methods in performance, even without the implementation of data augmentation strategies.

[CV-44] CipherDM: Secure Three-Party Inference for Diffusion Model Sampling

链接: https://arxiv.org/abs/2409.05414
作者: Xin Zhao,Xiaojun Chen,Xudong Chen,He Li,Tingyu Fan,Zhendong Zhao
关键词-EN: Diffusion Models, synthesis results, results in image, image generation, Diffusion
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion Models (DMs) achieve state-of-the-art synthesis results in image generation and have been applied to various fields. However, DMs sometimes seriously violate user privacy during usage, making the protection of privacy an urgent issue. Using traditional privacy computing schemes like Secure Multi-Party Computation (MPC) directly in DMs faces significant computation and communication challenges. To address these issues, we propose CipherDM, the first novel, versatile and universal framework applying MPC technology to DMs for secure sampling, which can be widely implemented on multiple DM based tasks. We thoroughly analyze sampling latency breakdown, find time-consuming parts and design corresponding secure MPC protocols for computing nonlinear activations including SoftMax, SiLU and Mish. CipherDM is evaluated on popular architectures (DDPM, DDIM) using MNIST dataset and on SD deployed by diffusers. Compared to direct implementation on SPU, our approach improves running time by approximately 1.084\times \sim 2.328\times, and reduces communication costs by approximately 1.212\times \sim 1.791\times.

[CV-45] From Words to Poses: Enhancing Novel Object Pose Estimation with Vision Language Models

链接: https://arxiv.org/abs/2409.05413
作者: Tessa Pulli,Stefan Thalhammer,Simon Schwaiger,Markus Vincze
关键词-EN: Robots are increasingly, increasingly envisioned, envisioned to interact, continuously adapt, Robots
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Robots are increasingly envisioned to interact in real-world scenarios, where they must continuously adapt to new situations. To detect and grasp novel objects, zero-shot pose estimators determine poses without prior knowledge. Recently, vision language models (VLMs) have shown considerable advances in robotics applications by establishing an understanding between language input and image input. In our work, we take advantage of VLMs zero-shot capabilities and translate this ability to 6D object pose estimation. We propose a novel framework for promptable zero-shot 6D object pose estimation using language embeddings. The idea is to derive a coarse location of an object based on the relevancy map of a language-embedded NeRF reconstruction and to compute the pose estimate with a point cloud registration method. Additionally, we provide an analysis of LERF’s suitability for open-set object pose estimation. We examine hyperparameters, such as activation thresholds for relevancy maps and investigate the zero-shot capabilities on an instance- and category-level. Furthermore, we plan to conduct robotic grasping experiments in a real-world setting.

[CV-46] KRONC: Keypoint-based Robust Camera Optimization for 3D Car Reconstruction ECCV

链接: https://arxiv.org/abs/2409.05407
作者: Davide Di Nucci,Alessandro Simoni,Matteo Tomei,Luca Ciuffreda,Roberto Vezzani,Rita Cucchiara
关键词-EN: widely discussed topic, gained additional attention, NeRF-based approaches, set of images, widely discussed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCVW

点击查看摘要

Abstract:The three-dimensional representation of objects or scenes starting from a set of images has been a widely discussed topic for years and has gained additional attention after the diffusion of NeRF-based approaches. However, an underestimated prerequisite is the knowledge of camera poses or, more specifically, the estimation of the extrinsic calibration parameters. Although excellent general-purpose Structure-from-Motion methods are available as a pre-processing step, their computational load is high and they require a lot of frames to guarantee sufficient overlapping among the views. This paper introduces KRONC, a novel approach aimed at inferring view poses by leveraging prior knowledge about the object to reconstruct and its representation through semantic keypoints. With a focus on vehicle scenes, KRONC is able to estimate the position of the views as a solution to a light optimization problem targeting the convergence of keypoints’ back-projections to a singular point. To validate the method, a specific dataset of real-world car scenes has been collected. Experiments confirm KRONC’s ability to generate excellent estimates of camera poses starting from very coarse initialization. Results are comparable with Structure-from-Motion methods with huge savings in computation. Code and data will be made publicly available.

[CV-47] A Survey of Multimodal Composite Editing and Retrieval

链接: https://arxiv.org/abs/2409.05405
作者: Suyan Li,Fuxiang Huang,Lei Zhang
关键词-EN: improve retrieval systems, Multimodal composite retrieval, composite retrieval, real world, focus of research
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注: 22 pages, 3 figures, and 11 tables

点击查看摘要

Abstract:In the real world, where information is abundant and diverse across different modalities, understanding and utilizing various data types to improve retrieval systems is a key focus of research. Multimodal composite retrieval integrates diverse modalities such as text, image and audio, etc. to provide more accurate, personalized, and contextually relevant results. To facilitate a deeper understanding of this promising direction, this survey explores multimodal composite editing and retrieval in depth, covering image-text composite editing, image-text composite retrieval, and other multimodal composite retrieval. In this survey, we systematically organize the application scenarios, methods, benchmarks, experiments, and future directions. Multimodal learning is a hot topic in large model era, and have also witnessed some surveys in multimodal learning and vision-language models with transformers published in the PAMI journal. To the best of our knowledge, this survey is the first comprehensive review of the literature on multimodal composite retrieval, which is a timely complement of multimodal fusion to existing reviews. To help readers’ quickly track this field, we build the project page for this survey, which can be found at this https URL.

[CV-48] Sequential Posterior Sampling with Diffusion Models

链接: https://arxiv.org/abs/2409.05399
作者: Tristan S.W. Stevens,Oisín Nolan,Jean-Luc Robert,Ruud J.G. van Sloun
关键词-EN: perform effective posterior, model complex distributions, effective posterior sampling, quickly risen, risen in popularity
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 5 pages, 4 figures, preprint

点击查看摘要

Abstract:Diffusion models have quickly risen in popularity for their ability to model complex distributions and perform effective posterior sampling. Unfortunately, the iterative nature of these generative models makes them computationally expensive and unsuitable for real-time sequential inverse problems such as ultrasound imaging. Considering the strong temporal structure across sequences of frames, we propose a novel approach that models the transition dynamics to improve the efficiency of sequential diffusion posterior sampling in conditional image synthesis. Through modeling sequence data using a video vision transformer (ViViT) transition model based on previous diffusion outputs, we can initialize the reverse diffusion trajectory at a lower noise scale, greatly reducing the number of iterations required for convergence. We demonstrate the effectiveness of our approach on a real-world dataset of high frame rate cardiac ultrasound images and show that it achieves the same performance as a full diffusion trajectory while accelerating inference 25 \times , enabling real-time posterior sampling. Furthermore, we show that the addition of a transition model improves the PSNR up to 8% in cases with severe motion. Our method opens up new possibilities for real-time applications of diffusion models in imaging and other domains requiring real-time inference.

[CV-49] FacialFlowNet: Advancing Facial Optical Flow Estimation with a Diverse Dataset and a Decomposed Model

链接: https://arxiv.org/abs/2409.05396
作者: Jianzhi Lu,Ruian He,Shili Zhou,Weimin Tan,Bo Yan
关键词-EN: facial optical flow, optical flow, Facial movements play, Facial, flow
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: ACMMM2024

点击查看摘要

Abstract:Facial movements play a crucial role in conveying altitude and intentions, and facial optical flow provides a dynamic and detailed representation of it. However, the scarcity of datasets and a modern baseline hinders the progress in facial optical flow research. This paper proposes FacialFlowNet (FFN), a novel large-scale facial optical flow dataset, and the Decomposed Facial Flow Model (DecFlow), the first method capable of decomposing facial flow. FFN comprises 9,635 identities and 105,970 image pairs, offering unprecedented diversity for detailed facial and head motion analysis. DecFlow features a facial semantic-aware encoder and a decomposed flow decoder, excelling in accurately estimating and decomposing facial flow into head and expression components. Comprehensive experiments demonstrate that FFN significantly enhances the accuracy of facial flow estimation across various optical flow methods, achieving up to an 11% reduction in Endpoint Error (EPE) (from 3.91 to 3.48). Moreover, DecFlow, when coupled with FFN, outperforms existing methods in both synthetic and real-world scenarios, enhancing facial expression analysis. The decomposed expression flow achieves a substantial accuracy improvement of 18% (from 69.1% to 82.1%) in micro-expressions recognition. These contributions represent a significant advancement in facial motion analysis and optical flow estimation. Codes and datasets can be found.

[CV-50] Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision Language Modeling

链接: https://arxiv.org/abs/2409.05395
作者: Georgios Pantazopoulos,Malvina Nikandrou,Alessandro Suglia,Oliver Lemon,Arash Eshghi
关键词-EN: Visual Language Models, Language Models, study explores replacing, recent structured state, structured state space
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study explores replacing Transformers in Visual Language Models (VLMs) with Mamba, a recent structured state space model (SSM) that demonstrates promising performance in sequence modeling. We test models up to 3B parameters under controlled conditions, showing that Mamba-based VLMs outperforms Transformers-based VLMs in captioning, question answering, and reading comprehension. However, we find that Transformers achieve greater performance in visual grounding and the performance gap widens with scale. We explore two hypotheses to explain this phenomenon: 1) the effect of task-agnostic visual encoding on the updates of the hidden states, and 2) the difficulty in performing visual grounding from the perspective of in-context multimodal retrieval. Our results indicate that a task-aware encoding yields minimal performance gains on grounding, however, Transformers significantly outperform Mamba at in-context multimodal retrieval. Overall, Mamba shows promising performance on tasks where the correct output relies on a summary of the image but struggles when retrieval of explicit information from the context is required.

[CV-51] AVP: Task-Adaptive Visual Prompt for Cross-domain Few-shot Segmentation

链接: https://arxiv.org/abs/2409.05393
作者: Jiaqi Yang,Ye Huang,Xiangjian He,Linlin Shen,Guoping Qiu
关键词-EN: large visual models, demonstrated significant potential, large-scale pre-training, large visual, backdrop of large-scale
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Under the backdrop of large-scale pre-training, large visual models (LVM) have demonstrated significant potential in image understanding. The recent emergence of the Segment Anything Model (SAM) has brought a qualitative shift in the field of image segmentation, supporting flexible interactive cues and strong learning capabilities. However, its performance often falls short in cross-domain and few-shot applications. Transferring prior knowledge from foundation models to new applications while preserving learning capabilities is worth exploring. This work proposes a task-adaptive prompt framework based on SAM, a new paradigm for Cross-dominan few-shot segmentation (CD-FSS). First, a Multi-level Feature Fusion (MFF) was used for integrated feature extraction. Besides, an additional Class Domain Task-Adaptive Auto-Prompt (CDTAP) module was combined with the segmentation branch for class-domain agnostic feature extraction and high-quality learnable prompt production. This significant advancement uses a unique generative approach to prompts alongside a comprehensive model structure and specialized prototype computation. While ensuring that the prior knowledge of SAM is not discarded, the new branch disentangles category and domain information through prototypes, guiding it in adapting the CD-FSS. We have achieved the best results on three benchmarks compared to the recent state-of-the-art (SOTA) methods. Comprehensive experiments showed that after task-specific and weighted guidance, the abundant feature information of SAM can be better learned for CD-FSS.

[CV-52] A Novel Representation of Periodic Pattern and Its Application to Untrained Anomaly Detection

链接: https://arxiv.org/abs/2409.05389
作者: Peng Ye,Chengyu Tao,Juan Du
关键词-EN: carbon fiber textiles, possess periodic textures, textures or surfaces, display panels, periodic pattern
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:There are a variety of industrial products that possess periodic textures or surfaces, such as carbon fiber textiles and display panels. Traditional image-based quality inspection methods for these products require identifying the periodic patterns from normal images (without anomaly and noise) and subsequently detecting anomaly pixels with inconsistent appearances. However, it remains challenging to accurately extract the periodic pattern from a single image in the presence of unknown anomalies and measurement noise. To deal with this challenge, this paper proposes a novel self-representation of the periodic image defined on a set of continuous parameters. In this way, periodic pattern learning can be embedded into a joint optimization framework, which is named periodic-sparse decomposition, with simultaneously modeling the sparse anomalies and Gaussian noise. Finally, for the real-world industrial images that may not strictly satisfy the periodic assumption, we propose a novel pixel-level anomaly scoring strategy to enhance the performance of anomaly detection. Both simulated and real-world case studies demonstrate the effectiveness of the proposed methodology for periodic pattern learning and anomaly detection.

[CV-53] Decoupling Contact for Fine-Grained Motion Style Transfer

链接: https://arxiv.org/abs/2409.05387
作者: Xiangjun Tang,Linjun Wu,He Wang,Yiqian Wu,Bo Hu,Songnan Li,Xu Gong,Yuchen Liao,Qilong Kou,Xiaogang Jin
关键词-EN: Motion style transfer, style transfer, style, Motion, animations and games
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Motion style transfer changes the style of a motion while retaining its content and is useful in computer animations and games. Contact is an essential component of motion style transfer that should be controlled explicitly in order to express the style vividly while enhancing motion naturalness and quality. However, it is unknown how to decouple and control contact to achieve fine-grained control in motion style transfer. In this paper, we present a novel style transfer method for fine-grained control over contacts while achieving both motion naturalness and spatial-temporal variations of style. Based on our empirical evidence, we propose controlling contact indirectly through the hip velocity, which can be further decomposed into the trajectory and contact timing, respectively. To this end, we propose a new model that explicitly models the correlations between motions and trajectory/contact timing/style, allowing us to decouple and control each separately. Our approach is built around a motion manifold, where hip controls can be easily integrated into a Transformer-based decoder. It is versatile in that it can generate motions directly as well as be used as post-processing for existing methods to improve quality and contact controllability. In addition, we propose a new metric that measures a correlation pattern of motions based on our empirical evidence, aligning well with human perception in terms of motion naturalness. Based on extensive evaluation, our method outperforms existing methods in terms of style expressivity and motion quality.

[CV-54] Look One and More: Distilling Hybrid Order Relational Knowledge for Cross-Resolution Image Recognition AAAI2020

链接: https://arxiv.org/abs/2409.05384
作者: Shiming Ge,Kangkai Zhang,Haolin Liu,Yingying Hua,Shengwei Zhao,Xin Jin,Hao Wen
关键词-EN: recent deep models, low accuracy due, directly applying, resolution degradation, low-resolution images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: Accepted by AAAI 2020

点击查看摘要

Abstract:In spite of great success in many image recognition tasks achieved by recent deep models, directly applying them to recognize low-resolution images may suffer from low accuracy due to the missing of informative details during resolution degradation. However, these images are still recognizable for subjects who are familiar with the corresponding high-resolution ones. Inspired by that, we propose a teacher-student learning approach to facilitate low-resolution image recognition via hybrid order relational knowledge distillation. The approach refers to three streams: the teacher stream is pretrained to recognize high-resolution images in high accuracy, the student stream is learned to identify low-resolution images by mimicking the teacher’s behaviors, and the extra assistant stream is introduced as bridge to help knowledge transfer across the teacher to the student. To extract sufficient knowledge for reducing the loss in accuracy, the learning of student is supervised with multiple losses, which preserves the similarities in various order relational structures. In this way, the capability of recovering missing details of familiar low-resolution images can be effectively enhanced, leading to a better knowledge transfer. Extensive experiments on metric learning, low-resolution image classification and low-resolution face recognition tasks show the effectiveness of our approach, while taking reduced models.

[CV-55] Deep Learning for Video Anomaly Detection: A Review

链接: https://arxiv.org/abs/2409.05383
作者: Peng Wu,Chengyu Pan,Yuting Yan,Guansong Pang,Peng Wang,Yanning Zhang
关键词-EN: Video anomaly detection, Video anomaly, VAD, aims to discover, discover behaviors
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Video anomaly detection (VAD) aims to discover behaviors or events deviating from the normality in videos. As a long-standing task in the field of computer vision, VAD has witnessed much good progress. In the era of deep learning, with the explosion of architectures of continuously growing capability and capacity, a great variety of deep learning based methods are constantly emerging for the VAD task, greatly improving the generalization ability of detection algorithms and broadening the application scenarios. Therefore, such a multitude of methods and a large body of literature make a comprehensive survey a pressing necessity. In this paper, we present an extensive and comprehensive research review, covering the spectrum of five different categories, namely, semi-supervised, weakly supervised, fully supervised, unsupervised and open-set supervised VAD, and we also delve into the latest VAD works based on pre-trained large models, remedying the limitations of past reviews in terms of only focusing on semi-supervised VAD and small model based methods. For the VAD task with different levels of supervision, we construct a well-organized taxonomy, profoundly discuss the characteristics of different types of methods, and show their performance comparisons. In addition, this review involves the public datasets, open-source codes, and evaluation metrics covering all the aforementioned VAD tasks. Finally, we provide several important research directions for the VAD community.

[CV-56] Boosting CLIP Adaptation for Image Quality Assessment via Meta-Prompt Learning and Gradient Regularization

链接: https://arxiv.org/abs/2409.05381
作者: Xudong Li,Zihao Huang,Runze Hu,Yan Zhang,Liujuan Cao,Rongrong Ji
关键词-EN: Image Quality Assessment, diverse image content, Quality Assessment, Image Quality, diverse image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image Quality Assessment (IQA) remains an unresolved challenge in the field of computer vision, due to complex distortion conditions, diverse image content, and limited data availability. The existing Blind IQA (BIQA) methods heavily rely on extensive human annotations to train models, which is both labor-intensive and costly due to the demanding nature of creating IQA datasets. To mitigate the dependence on labeled samples, this paper introduces a novel Gradient-Regulated Meta-Prompt IQA Framework (GRMP-IQA). This framework aims to fast adapt the powerful visual-language pre-trained model, CLIP, to downstream IQA tasks, significantly improving accuracy in scenarios with limited data. Specifically, the GRMP-IQA comprises two key modules: Meta-Prompt Pre-training Module and Quality-Aware Gradient Regularization. The Meta Prompt Pre-training Module leverages a meta-learning paradigm to pre-train soft prompts with shared meta-knowledge across different distortions, enabling rapid adaptation to various IQA tasks. On the other hand, the Quality-Aware Gradient Regularization is designed to adjust the update gradients during fine-tuning, focusing the model’s attention on quality-relevant features and preventing overfitting to semantic information. Extensive experiments on five standard BIQA datasets demonstrate the superior performance to the state-of-the-art BIQA methods under limited data setting, i.e., achieving SRCC values of 0.836 (vs. 0.760 on LIVEC) and 0.853 (vs. 0.812 on KonIQ). Notably, utilizing just 20% of the training data, our GRMP-IQA outperforms most existing fully supervised BIQA methods.

[CV-57] Prim2Room: Layout-Controllable Room Mesh Generation from Primitives

链接: https://arxiv.org/abs/2409.05380
作者: Chengzeng Feng,Jiacheng Wei,Cheng Chen,Yang Li,Pan Ji,Fayao Liu,Hongdong Li,Guosheng Lin
关键词-EN: mesh generation leveraging, layout specification, controllable room mesh, room mesh generation, layout conditions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose Prim2Room, a novel framework for controllable room mesh generation leveraging 2D layout conditions and 3D primitive retrieval to facilitate precise 3D layout specification. Diverging from existing methods that lack control and precision, our approach allows for detailed customization of room-scale environments. To overcome the limitations of previous methods, we introduce an adaptive viewpoint selection algorithm that allows the system to generate the furniture texture and geometry from more favorable views than predefined camera trajectories. Additionally, we employ non-rigid depth registration to ensure alignment between generated objects and their corresponding primitive while allowing for shape variations to maintain diversity. Our method not only enhances the accuracy and aesthetic appeal of generated 3D scenes but also provides a user-friendly platform for detailed room design.

[CV-58] PersonaTalk: Bring Attention to Your Persona in Visual Dubbing SIGGRAPH

链接: https://arxiv.org/abs/2409.05379
作者: Longhao Zhang,Shuang Liang,Zhipeng Ge,Tianshu Hu
关键词-EN: accurate lip synchronization, synthesizing accurate lip, audio-driven visual dubbing, lip synchronization, remains a considerable
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: Accepted at SIGGRAPH Asia 2024 (Conference Track)

点击查看摘要

Abstract:For audio-driven visual dubbing, it remains a considerable challenge to uphold and highlight speaker’s persona while synthesizing accurate lip synchronization. Existing methods fall short of capturing speaker’s unique speaking style or preserving facial details. In this paper, we present PersonaTalk, an attention-based two-stage framework, including geometry construction and face rendering, for high-fidelity and personalized visual dubbing. In the first stage, we propose a style-aware audio encoding module that injects speaking style into audio features through a cross-attention layer. The stylized audio features are then used to drive speaker’s template geometry to obtain lip-synced geometries. In the second stage, a dual-attention face renderer is introduced to render textures for the target geometries. It consists of two parallel cross-attention layers, namely Lip-Attention and Face-Attention, which respectively sample textures from different reference frames to render the entire face. With our innovative design, intricate facial details can be well preserved. Comprehensive experiments and user studies demonstrate our advantages over other state-of-the-art methods in terms of visual quality, lip-sync accuracy and persona preservation. Furthermore, as a person-generic framework, PersonaTalk can achieve competitive performance as state-of-the-art person-specific methods. Project Page: this https URL.

[CV-59] Memoryless Multimodal Anomaly Detection via Student-Teacher Network and Signed Distance Learning

链接: https://arxiv.org/abs/2409.05378
作者: Zhongbin Sun,Xiaolong Li,Yiran Li,Yue Ma
关键词-EN: Unsupervised anomaly detection, computer vision task, challenging computer vision, Unsupervised anomaly, anomaly detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 4 figures, 2 tables, to be published in PRCV-2024

点击查看摘要

Abstract:Unsupervised anomaly detection is a challenging computer vision task, in which 2D-based anomaly detection methods have been extensively studied. However, multimodal anomaly detection based on RGB images and 3D point clouds requires further investigation. The existing methods are mainly inspired by memory bank based methods commonly used in 2D-based anomaly detection, which may cost extra memory for storing mutimodal features. In present study, a novel memoryless method MDSS is proposed for multimodal anomaly detection, which employs a light-weighted student-teacher network and a signed distance function to learn from RGB images and 3D point clouds respectively, and complements the anomaly information from the two modalities. Specifically, a student-teacher network is trained with normal RGB images and masks generated from point clouds by a dynamic loss, and the anomaly score map could be obtained from the discrepancy between the output of student and teacher. Furthermore, the signed distance function learns from normal point clouds to predict the signed distances between points and surface, and the obtained signed distances are used to generate anomaly score map. Subsequently, the anomaly score maps are aligned to generate the final anomaly score map for detection. The experimental results indicate that MDSS is comparable but more stable than the SOTA memory bank based method Shape-guided, and furthermore performs better than other baseline methods.

[CV-60] KARGEN: Knowledge-enhanced Automated Radiology Report Generation Using Large Language Models

链接: https://arxiv.org/abs/2409.05370
作者: Yingshu Li,Zhanyu Wang,Yunyi Liu,Lei Wang,Lingqiao Liu,Luping Zhou
关键词-EN: Large Language Models, Harnessing the robust, Large Language, Language Models, automated radiology report
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Harnessing the robust capabilities of Large Language Models (LLMs) for narrative generation, logical reasoning, and common-sense knowledge integration, this study delves into utilizing LLMs to enhance automated radiology report generation (R2Gen). Despite the wealth of knowledge within LLMs, efficiently triggering relevant knowledge within these large models for specific tasks like R2Gen poses a critical research challenge. This paper presents KARGEN, a Knowledge-enhanced Automated radiology Report GENeration framework based on LLMs. Utilizing a frozen LLM to generate reports, the framework integrates a knowledge graph to unlock chest disease-related knowledge within the LLM to enhance the clinical utility of generated reports. This is achieved by leveraging the knowledge graph to distill disease-related features in a designed way. Since a radiology report encompasses both normal and disease-related findings, the extracted graph-enhanced disease-related features are integrated with regional image features, attending to both aspects. We explore two fusion methods to automatically prioritize and select the most relevant features. The fused features are employed by LLM to generate reports that are more sensitive to diseases and of improved quality. Our approach demonstrates promising results on the MIMIC-CXR and IU-Xray datasets.

[CV-61] FedBrain-Distill: Communication-Efficient Federated Brain Tumor Classification Using Ensemble Knowledge Distillation on Non-IID Data

链接: https://arxiv.org/abs/2409.05359
作者: Rasoul Jafari Gohari,Laya Aliahmadipour,Ezat Valipour
关键词-EN: Magnetic Resonance Imaging, human body, complex organs, making brain tumors, Figshare brain tumor
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Brain is one the most complex organs in the human body. Due to its complexity, classification of brain tumors still poses a significant challenge, making brain tumors a particularly serious medical issue. Techniques such as Machine Learning (ML) coupled with Magnetic Resonance Imaging (MRI) have paved the way for doctors and medical institutions to classify different types of tumors. However, these techniques suffer from limitations that violate patients privacy. Federated Learning (FL) has recently been introduced to solve such an issue, but the FL itself suffers from limitations like communication costs and dependencies on model architecture, forcing all models to have identical architectures. In this paper, we propose FedBrain-Distill, an approach that leverages Knowledge Distillation (KD) in an FL setting that maintains the users privacy and ensures the independence of FL clients in terms of model architecture. FedBrain-Distill uses an ensemble of teachers that distill their knowledge to a simple student model. The evaluation of FedBrain-Distill demonstrated high-accuracy results for both Independent and Identically Distributed (IID) and non-IID data with substantial low communication costs on the real-world Figshare brain tumor dataset. It is worth mentioning that we used Dirichlet distribution to partition the data into IID and non-IID data. All the implementation details are accessible through our Github repository.

[CV-62] Driving with Prior Maps: Unified Vector Prior Encoding for Autonomous Vehicle Mapping

链接: https://arxiv.org/abs/2409.05352
作者: Shuang Zeng,Xinyuan Chang,Xinran Liu,Zheng Pan,Xing Wei
关键词-EN: upkeep present significant, present significant cost, Maps, prior maps, Standard Definition Maps
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:High-Definition Maps (HD maps) are essential for the precise navigation and decision-making of autonomous vehicles, yet their creation and upkeep present significant cost and timeliness challenges. The online construction of HD maps using on-board sensors has emerged as a promising solution; however, these methods can be impeded by incomplete data due to occlusions and inclement weather. This paper proposes the PriorDrive framework to addresses these limitations by harnessing the power of prior maps, significantly enhancing the robustness and accuracy of online HD map construction. Our approach integrates a variety of prior maps, such as OpenStreetMap’s Standard Definition Maps (SD maps), outdated HD maps from vendors, and locally constructed maps from historical vehicle data. To effectively encode this prior information into online mapping models, we introduce a Hybrid Prior Representation (HPQuery) that standardizes the representation of diverse map elements. At the core of PriorDrive is the Unified Vector Encoder (UVE), which employs a dual encoding mechanism to process vector data. The intra-vector encoder captures fine-grained local features, while the inter-vector encoder integrates global context. Furthermore, we propose a segment-level and point-level pre-training strategy that enables the UVE to learn the prior distribution of vector data, thereby improving the encoder’s generalizability and performance. Through extensive testing on the nuScenes dataset, we demonstrate that PriorDrive is highly compatible with various online mapping models and substantially improves map prediction capabilities. The integration of prior maps through the PriorDrive framework offers a robust solution to the challenges of single-perception data, paving the way for more reliable autonomous vehicle navigation.

[CV-63] Early-exit Convolutional Neural Networks

链接: https://arxiv.org/abs/2409.05336
作者: Edanur Demir,Emre Akbas
关键词-EN: convolutional neural networks, computational cost, aimed at developing, developing a method, method that reduces
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper is aimed at developing a method that reduces the computational cost of convolutional neural networks (CNN) during inference. Conventionally, the input data pass through a fixed neural network architecture. However, easy examples can be classified at early stages of processing and conventional networks do not take this into account. In this paper, we introduce ‘Early-exit CNNs’, EENets for short, which adapt their computational cost based on the input by stopping the inference process at certain exit locations. In EENets, there are a number of exit blocks each of which consists of a confidence branch and a softmax branch. The confidence branch computes the confidence score of exiting (i.e. stopping the inference process) at that location; while the softmax branch outputs a classification probability vector. Both branches are learnable and their parameters are separate. During training of EENets, in addition to the classical classification loss, the computational cost of inference is taken into account as well. As a result, the network adapts its many confidence branches to the inputs so that less computation is spent for easy examples. Inference works as in conventional feed-forward networks, however, when the output of a confidence branch is larger than a certain threshold, the inference stops for that specific example. The idea of EENets is applicable to available CNN architectures such as ResNets. Through comprehensive experiments on MNIST, SVHN, CIFAR10 and Tiny-ImageNet datasets, we show that early-exit (EE) ResNets achieve similar accuracy with their non-EE versions while reducing the computational cost to 20% of the original. Code is available at this https URL

[CV-64] A Multi-Modal Deep Learning Based Approach for House Price Prediction

链接: https://arxiv.org/abs/2409.05335
作者: Md Hasebul Hasan,Md Abid Jahan,Mohammed Eunus Ali,Yuan-Fang Li,Timos Sellis
关键词-EN: house, house price, house price prediction, real estate sector, residential real estate
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 22 pages

点击查看摘要

Abstract:Accurate prediction of house price, a vital aspect of the residential real estate sector, is of substantial interest for a wide range of stakeholders. However, predicting house prices is a complex task due to the significant variability influenced by factors such as house features, location, neighborhood, and many others. Despite numerous attempts utilizing a wide array of algorithms, including recent deep learning techniques, to predict house prices accurately, existing approaches have fallen short of considering a wide range of factors such as textual and visual features. This paper addresses this gap by comprehensively incorporating attributes, such as features, textual descriptions, geo-spatial neighborhood, and house images, typically showcased in real estate listings in a house price prediction system. Specifically, we propose a multi-modal deep learning approach that leverages different types of data to learn more accurate representation of the house. In particular, we learn a joint embedding of raw house attributes, geo-spatial neighborhood, and most importantly from textual description and images representing the house; and finally use a downstream regression model to predict the house price from this jointly learned embedding vector. Our experimental results with a real-world dataset show that the text embedding of the house advertisement description and image embedding of the house pictures in addition to raw attributes and geo-spatial embedding, can significantly improve the house price prediction accuracy. The relevant source code and dataset are publicly accessible at the following URL: this https URL

[CV-65] Lagrangian Hashing for Compressed Neural Field Representations

链接: https://arxiv.org/abs/2409.05334
作者: Shrisudhan Govindarajan,Zeno Sambugaro,Akhmedkhan(Ahan)Shabanov,Towaki Takikawa,Daniel Rebain,Weiwei Sun,Nicola Conci,Kwang Moo Yi,Andrea Tagliasacchi
关键词-EN: present Lagrangian Hashing, Lagrangian Hashing, fast training NeRF, training NeRF methods, Eulerian grids
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:We present Lagrangian Hashing, a representation for neural fields combining the characteristics of fast training NeRF methods that rely on Eulerian grids (i.e.~InstantNGP), with those that employ points equipped with features as a way to represent information (e.g. 3D Gaussian Splatting or PointNeRF). We achieve this by incorporating a point-based representation into the high-resolution layers of the hierarchical hash tables of an InstantNGP representation. As our points are equipped with a field of influence, our representation can be interpreted as a mixture of Gaussians stored within the hash table. We propose a loss that encourages the movement of our Gaussians towards regions that require more representation budget to be sufficiently well represented. Our main finding is that our representation allows the reconstruction of signals using a more compact representation without compromising quality.

[CV-66] KAN-Based Fusion of Dual-Domain for Audio-Driven Facial Landmarks Generation

链接: https://arxiv.org/abs/2409.05330
作者: Hoang-Son Vo-Thanh,Quang-Vinh Nguyen,Soo-Hyung Kim
关键词-EN: widely researched topic, researched topic due, Audio-driven talking face, talking face generation, talking face
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Audio-driven talking face generation is a widely researched topic due to its high applicability. Reconstructing a talking face using audio significantly contributes to fields such as education, healthcare, online conversations, virtual assistants, and virtual reality. Early studies often focused solely on changing the mouth movements, which resulted in outcomes with limited practical applications. Recently, researchers have proposed a new approach of constructing the entire face, including face pose, neck, and shoulders. To achieve this, they need to generate through landmarks. However, creating stable landmarks that align well with the audio is a challenge. In this paper, we propose the KFusion of Dual-Domain model, a robust model that generates landmarks from audio. We separate the audio into two distinct domains to learn emotional information and facial context, then use a fusion mechanism based on the KAN model. Our model demonstrates high efficiency compared to recent models. This will lay the groundwork for the development of the audio-driven talking face generation problem in the future.

[CV-67] ICPR 2024 Competition on Safe Segmentation of Drive Scenes in Unstructured Traffic and Adverse Weather Conditions ICPR

链接: https://arxiv.org/abs/2409.05327
作者: Furqan Ahmed Shaik,Sandeep Nagar,Aiswarya Maturi,Harshit Kumar Sankhla,Dibyendu Ghosh,Anshuman Majumdar,Srikanth Vidapanakal,Kunal Chaudhary,Sunny Manchanda,Girish Varma
关键词-EN: Adverse Weather Conditions, Weather Conditions served, Adverse Weather, Weather Conditions, Scenes in Unstructured
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 15 pages, 7 figures, ICPR Competition Paper

点击查看摘要

Abstract:The ICPR 2024 Competition on Safe Segmentation of Drive Scenes in Unstructured Traffic and Adverse Weather Conditions served as a rigorous platform to evaluate and benchmark state-of-the-art semantic segmentation models under challenging conditions for autonomous driving. Over several months, participants were provided with the IDD-AW dataset, consisting of 5000 high-quality RGB-NIR image pairs, each annotated at the pixel level and captured under adverse weather conditions such as rain, fog, low light, and snow. A key aspect of the competition was the use and improvement of the Safe mean Intersection over Union (Safe mIoU) metric, designed to penalize unsafe incorrect predictions that could be overlooked by traditional mIoU. This innovative metric emphasized the importance of safety in developing autonomous driving systems. The competition showed significant advancements in the field, with participants demonstrating models that excelled in semantic segmentation and prioritized safety and robustness in unstructured and adverse conditions. The results of the competition set new benchmarks in the domain, highlighting the critical role of safety in deploying autonomous vehicles in real-world scenarios. The contributions from this competition are expected to drive further innovation in autonomous driving technology, addressing the critical challenges of operating in diverse and unpredictable environments.

[CV-68] FIF-UNet: An Efficient UNet Using Feature Interaction and Fusion for Medical Image Segmentation

链接: https://arxiv.org/abs/2409.05324
作者: Xiaolin Gou,Chuanlin Liao,Jizhe Zhou,Fengshuo Ye,Yi Lin
关键词-EN: medical image segmentation, capture complex feature, medical image, ability to capture, capture complex
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Nowadays, pre-trained encoders are widely used in medical image segmentation because of their ability to capture complex feature representations. However, the existing models fail to effectively utilize the rich features obtained by the pre-trained encoder, resulting in suboptimal segmentation results. In this work, a novel U-shaped model, called FIF-UNet, is proposed to address the above issue, including three plug-and-play modules. A channel spatial interaction module (CSI) is proposed to obtain informative features by establishing the interaction between encoder stages and corresponding decoder stages. A cascaded conv-SE module (CoSE) is designed to enhance the representation of critical features by adaptively assigning importance weights on different feature channels. A multi-level fusion module (MLF) is proposed to fuse the multi-scale features from the decoder stages, ensuring accurate and robust final segmentation. Comprehensive experiments on the Synapse and ACDC datasets demonstrate that the proposed FIF-UNet outperforms existing state-of-the-art methods, which achieves the highest average DICE of 86.05% and 92.58%, respectively.

[CV-69] Open-World Dynamic Prompt and Continual Visual Representation Learning ECCV2024

链接: https://arxiv.org/abs/2409.05312
作者: Youngeun Kim,Jun Fang,Qin Zhang,Zhaowei Cai,Yantao Shen,Rahul Duggal,Dripta S. Raychaudhuri,Zhuowen Tu,Yifan Xing,Onkar Dabeer
关键词-EN: characterized by ever-evolving, concepts and distributions, open world, world is inherently, ever-evolving concepts
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024

点击查看摘要

Abstract:The open world is inherently dynamic, characterized by ever-evolving concepts and distributions. Continual learning (CL) in this dynamic open-world environment presents a significant challenge in effectively generalizing to unseen test-time classes. To address this challenge, we introduce a new practical CL setting tailored for open-world visual representation learning. In this setting, subsequent data streams systematically introduce novel classes that are disjoint from those seen in previous training phases, while also remaining distinct from the unseen test classes. In response, we present Dynamic Prompt and Representation Learner (DPaRL), a simple yet effective Prompt-based CL (PCL) method. Our DPaRL learns to generate dynamic prompts for inference, as opposed to relying on a static prompt pool in previous PCL methods. In addition, DPaRL jointly learns dynamic prompt generation and discriminative representation at each training stage whereas prior PCL methods only refine the prompt learning throughout the process. Our experimental results demonstrate the superiority of our approach, surpassing state-of-the-art methods on well-established open-world image retrieval benchmarks by an average of 4.7% improvement in Recall@1 performance.

[CV-70] Fitting Skeletal Models via Graph-based Learning

链接: https://arxiv.org/abs/2409.05311
作者: Nicolás Gaggion,Enzo Ferrante,Beatriz Paniagua,Jared Vicory
关键词-EN: popular shape analysis, shape analysis technique, popular shape, shape analysis, analysis technique
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper was presented at the 2024 IEEE International Symposium on Biomedical Imaging (ISBI)

点击查看摘要

Abstract:Skeletonization is a popular shape analysis technique that models an object’s interior as opposed to just its boundary. Fitting template-based skeletal models is a time-consuming process requiring much manual parameter tuning. Recently, machine learning-based methods have shown promise for generating s-reps from object boundaries. In this work, we propose a new skeletonization method which leverages graph convolutional networks to produce skeletal representations (s-reps) from dense segmentation masks. The method is evaluated on both synthetic data and real hippocampus segmentations, achieving promising results and fast inference.

[CV-71] Neural Surface Reconstruction and Rendering for LiDAR-Visual Systems

链接: https://arxiv.org/abs/2409.05310
作者: Jianheng Liu,Chunran Zheng,Yunfei Wan,Bowen Wang,Yixi Cai,Fu Zhang
关键词-EN: Neural Radiance Fields, Neural Distance Fields, integrating Neural Radiance, Radiance Fields, Distance Fields
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper presents a unified surface reconstruction and rendering framework for LiDAR-visual systems, integrating Neural Radiance Fields (NeRF) and Neural Distance Fields (NDF) to recover both appearance and structural information from posed images and point clouds. We address the structural visible gap between NeRF and NDF by utilizing a visible-aware occupancy map to classify space into the free, occupied, visible unknown, and background regions. This classification facilitates the recovery of a complete appearance and structure of the scene. We unify the training of the NDF and NeRF using a spatial-varying scale SDF-to-density transformation for levels of detail for both structure and appearance. The proposed method leverages the learned NDF for structure-aware NeRF training by an adaptive sphere tracing sampling strategy for accurate structure rendering. In return, NeRF further refines structural in recovering missing or fuzzy structures in the NDF. Extensive experiments demonstrate the superior quality and versatility of the proposed method across various scenarios. To benefit the community, the codes will be released at \urlthis https URL.

[CV-72] RAL:Redundancy-Aware Lipreading Model Based on Differential Learning with Symmetric Views

链接: https://arxiv.org/abs/2409.05307
作者: Zejun gu,Junxia jiang
关键词-EN: reading involves interpreting, Lip reading involves, reading involves, involves interpreting, interpreting a speaker
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 4 figures

点击查看摘要

Abstract:Lip reading involves interpreting a speaker’s speech by analyzing sequences of lip movements. Currently, most models regard the left and right halves of the lips as a symmetrical whole, lacking a thorough investigation of their differences. However, the left and right halves of the lips are not always symmetrical, and the subtle differences between them contain rich semantic information. In this paper, we propose a differential learning strategy with symmetric views (DLSV) to address this issue. Additionally, input images often contain a lot of redundant information unrelated to recognition results, which can degrade the model’s performance. We present a redundancy-aware operation (RAO) to reduce it. Finally, to leverage the relational information between symmetric views and within each view, we further design an adaptive cross-view interaction module (ACVI). Experiments on LRW and LRW-1000 datasets fully demonstrate the effectiveness of our approach.

[CV-73] RotCAtt-TransUNet: Novel Deep Neural Network for Sophisticated Cardiac Segmentation

链接: https://arxiv.org/abs/2409.05280
作者: Quoc-Bao Nguyen-Le,Tuan-Hy Le,Anh-Triet Do,Quoc-Huy Trinh
关键词-EN: Cardiovascular disease, global health concern, major global health, health concern, Cardiovascular
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 6 pages, 5 figures

点击查看摘要

Abstract:Cardiovascular disease is a major global health concern, contributing significantly to global mortality. Accurately segmenting cardiac medical imaging data is crucial for reducing fatality rates associated with these conditions. However, current state-of-the-art (SOTA) neural networks, including CNN-based and Transformer-based approaches, face challenges in capturing both inter-slice connections and intra-slice details, especially in datasets featuring intricate, long-range details along the z-axis like coronary arteries. Existing methods also struggle with differentiating non-cardiac components from the myocardium, resulting in segmentation inaccuracies and the “spraying” phenomenon. To address these issues, we introduce RotCAtt-TransUNet++, a novel architecture designed for robust segmentation of intricate cardiac structures. Our approach enhances global context modeling through multiscale feature aggregation and nested skip connections in the encoder. Transformer layers facilitate capturing intra-slice interactions, while a rotatory attention mechanism handles inter-slice connectivity. A channel-wise cross-attention gate integrates multiscale information and decoder features, effectively bridging semantic gaps. Experimental results across multiple datasets demonstrate superior performance over current methods, achieving near-perfect annotation of coronary arteries and myocardium. Ablation studies confirm that our rotatory attention mechanism significantly improves segmentation accuracy by transforming embedded vectorized patches in semantic dimensional space.

[CV-74] BrainDecoder: Style-Based Visual Decoding of EEG Signals

链接: https://arxiv.org/abs/2409.05279
作者: Minsuk Choi,Hiroshi Ishikawa
关键词-EN: offers valuable insights, visual stimuli, Decoding neural representations, visual decoding, offers valuable
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 5 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Decoding neural representations of visual stimuli from electroencephalography (EEG) offers valuable insights into brain activity and cognition. Recent advancements in deep learning have significantly enhanced the field of visual decoding of EEG, primarily focusing on reconstructing the semantic content of visual stimuli. In this paper, we present a novel visual decoding pipeline that, in addition to recovering the content, emphasizes the reconstruction of the style, such as color and texture, of images viewed by the subject. Unlike previous methods, this ``style-based’’ approach learns in the CLIP spaces of image and text separately, facilitating a more nuanced extraction of information from EEG signals. We also use captions for text alignment simpler than previously employed, which we find work better. Both quantitative and qualitative evaluations show that our method better preserves the style of visual stimuli and extracts more fine-grained semantic information from neural signals. Notably, it achieves significant improvements in quantitative results and sets a new state-of-the-art on the popular Brain2Image dataset.

[CV-75] Disentangled Representations for Short-Term and Long-Term Person Re-Identification

链接: https://arxiv.org/abs/2409.05277
作者: Chanho Eom,Wonkyung Lee,Geon Lee,Bumsub Ham
关键词-EN: person, retrieving person images, features, unrelated features, person images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:1910.12003

点击查看摘要

Abstract:We address the problem of person re-identification (reID), that is, retrieving person images from a large dataset, given a query image of the person of interest. A key challenge is to learn person representations robust to intra-class variations, as different persons could have the same attribute, and persons’ appearances look different, e.g., with viewpoint changes. Recent reID methods focus on learning person features discriminative only for a particular factor of variations (e.g., human pose), which also requires corresponding supervisory signals (e.g., pose annotations). To tackle this problem, we propose to factorize person images into identity-related and unrelated features. Identity-related features contain information useful for specifying a particular person (e.g., clothing), while identity-unrelated ones hold other factors (e.g., human pose). To this end, we propose a new generative adversarial network, dubbed identity shuffle GAN (IS-GAN). It disentangles identity-related and unrelated features from person images through an identity-shuffling technique that exploits identification labels alone without any auxiliary supervisory signals. We restrict the distribution of identity-unrelated features or encourage the identity-related and unrelated features to be uncorrelated, facilitating the disentanglement process. Experimental results validate the effectiveness of IS-GAN, showing state-of-the-art performance on standard reID benchmarks, including Market-1501, CUHK03, and DukeMTMC-reID. We further demonstrate the advantages of disentangling person representations on a long-term reID task, setting a new state of the art on a Celeb-reID dataset.

[CV-76] Scalable Frame Sampling for Video Classification: A Semi-Optimal Policy Approach with Reduced Search Space

链接: https://arxiv.org/abs/2409.05260
作者: Junho Lee,Jeongwoo Shin,Seung Woo Ko,Seongsu Ha,Joonseok Lee
关键词-EN: fixed video classifier, video classifier, fixed video, semi-optimal policy, search space
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Given a video with T frames, frame sampling is a task to select N \ll T frames, so as to maximize the performance of a fixed video classifier. Not just brute-force search, but most existing methods suffer from its vast search space of \binomTN , especially when N gets large. To address this challenge, we introduce a novel perspective of reducing the search space from O(T^N) to O(T) . Instead of exploring the entire O(T^N) space, our proposed semi-optimal policy selects the top N frames based on the independently estimated value of each frame using per-frame confidence, significantly reducing the computational complexity. We verify that our semi-optimal policy can efficiently approximate the optimal policy, particularly under practical settings. Additionally, through extensive experiments on various datasets and model architectures, we demonstrate that learning our semi-optimal policy ensures stable and high performance regardless of the size of N and T .

[CV-77] owards Automated Machine Learning Research

链接: https://arxiv.org/abs/2409.05258
作者: Shervin Ardeshir
关键词-EN: Large Language Models, Large Language, automating incremental advances, machine learning research, facilitated by Large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper explores a top-down approach to automating incremental advances in machine learning research through component-level innovation, facilitated by Large Language Models (LLMs). Our framework systematically generates novel components, validates their feasibility, and evaluates their performance against existing baselines. A key distinction of this approach lies in how these novel components are generated. Unlike traditional AutoML and NAS methods, which often rely on a bottom-up combinatorial search over predefined, hardcoded base components, our method leverages the cross-domain knowledge embedded in LLMs to propose new components that may not be confined to any hard-coded predefined set. By incorporating a reward model to prioritize promising hypotheses, we aim to improve the efficiency of the hypothesis generation and evaluation process. We hope this approach offers a new avenue for exploration and contributes to the ongoing dialogue in the field.

[CV-78] MRStyle: A Unified Framework for Color Style Transfer with Multi-Modality Reference

链接: https://arxiv.org/abs/2409.05250
作者: Jiancheng Huang,Yu Gao,Zequn Jie,Yujie Zhong,Xintong Han,Lin Ma
关键词-EN: introduce MRStyle, enables color style, comprehensive framework, framework that enables, multi-modality reference
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we introduce MRStyle, a comprehensive framework that enables color style transfer using multi-modality reference, including image and text. To achieve a unified style feature space for both modalities, we first develop a neural network called IRStyle, which generates stylized 3D lookup tables for image reference. This is accomplished by integrating an interaction dual-mapping network with a combined supervised learning pipeline, resulting in three key benefits: elimination of visual artifacts, efficient handling of high-resolution images with low memory usage, and maintenance of style consistency even in situations with significant color style variations. For text reference, we align the text feature of stable diffusion priors with the style feature of our IRStyle to perform text-guided color style transfer (TRStyle). Our TRStyle method is highly efficient in both training and inference, producing notable open-set text-guided transfer results. Extensive experiments in both image and text settings demonstrate that our proposed method outperforms the state-of-the-art in both qualitative and quantitative evaluations.

[CV-79] Mamba-Enhanced Text-Audio-Video Alignment Network for Emotion Recognition in Conversations

链接: https://arxiv.org/abs/2409.05243
作者: Xinran Li,Xiaomao Fan,Qingyang Wu,Xiaojiang Peng,Ye Li
关键词-EN: Recognition in Conversations, multimodal interaction research, Emotion Recognition, interaction research, dedicated to accurately
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Emotion Recognition in Conversations (ERCs) is a vital area within multimodal interaction research, dedicated to accurately identifying and classifying the emotions expressed by speakers throughout a conversation. Traditional ERC approaches predominantly rely on unimodal cues-such as text, audio, or visual data-leading to limitations in their effectiveness. These methods encounter two significant challenges: 1) Consistency in multimodal information. Before integrating various modalities, it is crucial to ensure that the data from different sources is aligned and coherent. 2) Contextual information capture. Successfully fusing multimodal features requires a keen understanding of the evolving emotional tone, especially in lengthy dialogues where emotions may shift and develop over time. To address these limitations, we propose a novel Mamba-enhanced Text-Audio-Video alignment network (MaTAV) for the ERC task. MaTAV is with the advantages of aligning unimodal features to ensure consistency across different modalities and handling long input sequences to better capture contextual multimodal information. The extensive experiments on the MELD and IEMOCAP datasets demonstrate that MaTAV significantly outperforms existing state-of-the-art methods on the ERC task with a big margin.

[CV-80] A Low-Computational Video Synopsis Framework with a Standard Dataset

链接: https://arxiv.org/abs/2409.05230
作者: Ramtin Malekpour(1),M. Mehrdad Morsali(1),Hoda Mohammadzade(1) ((1) Sharif University of Technology, Tehran, Iran)
关键词-EN: condensing surveillance videos, Video synopsis, Video, video synopsis task, method for condensing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 8 figures

点击查看摘要

Abstract:Video synopsis is an efficient method for condensing surveillance videos. This technique begins with the detection and tracking of objects, followed by the creation of object tubes. These tubes consist of sequences, each containing chronologically ordered bounding boxes of a unique object. To generate a condensed video, the first step involves rearranging the object tubes to maximize the number of non-overlapping objects in each frame. Then, these tubes are stitched to a background image extracted from the source video. The lack of a standard dataset for the video synopsis task hinders the comparison of different video synopsis models. This paper addresses this issue by introducing a standard dataset, called SynoClip, designed specifically for the video synopsis task. SynoClip includes all the necessary features needed to evaluate various models directly and effectively. Additionally, this work introduces a video synopsis model, called FGS, with low computational cost. The model includes an empty-frame object detector to identify frames empty of any objects, facilitating efficient utilization of the deep object detector. Moreover, a tube grouping algorithm is proposed to maintain relationships among tubes in the synthesized video. This is followed by a greedy tube rearrangement algorithm, which efficiently determines the start time of each tube. Finally, the proposed model is evaluated using the proposed dataset. The source code, fine-tuned object detection model, and tutorials are available at this https URL.

[CV-81] Comparison of Two Augmentation Methods in Improving Detection Accuracy of Hemarthrosis

链接: https://arxiv.org/abs/2409.05225
作者: Qianyu Fan,Pascal N. Tyrrell
关键词-EN: rending medical diagnosis, machine learning models, augmentation techniques, augmentation, traditional augmentation techniques
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the increase of computing power, machine learning models in medical imaging have been introduced to help in rending medical diagnosis and inspection, like hemophilia, a rare disorder in which blood cannot clot normally. Often, one of the bottlenecks of detecting hemophilia is the lack of data available to train the algorithm to increase the accuracy. As a possible solution, this research investigated whether introducing augmented data by data synthesis or traditional augmentation techniques can improve model accuracy, helping to diagnose the diseases. To tackle this research, features of ultrasound images were extracted by the pre-trained VGG-16, and similarities were compared by cosine similarity measure based on extracted features in different distributions among real images, synthetic images, and augmentation images (Real vs. Real, Syn vs. Syn, Real vs. Different Batches of Syn, Real vs. Augmentation Techniques). Model testing performance was investigated using EffientNet-B4 to recognize “blood” images with two augmentation methods. In addition, a gradient-weighted class activation mapping (Grad-CAM) visualization was used to interpret the unexpected results like loss of accuracy. Synthetic and real images do not show high similarity, with a mean similarity score of 0.4737. Synthetic batch 1 dataset and images by horizontal flip are more similar to the original images. Classic augmentation techniques and data synthesis can improve model accuracy, and data by traditional augmentation techniques have a better performance than synthetic data. In addition, the Grad-CAM heatmap figured out the loss of accuracy is due to a shift in the domain. Overall, this research found that two augmentation methods, data synthesis and traditional augmentation techniques, both can improve accuracy to a certain extent to help to diagnose rare diseases.

[CV-82] A Survey on Mixup Augmentations and Beyond

链接: https://arxiv.org/abs/2409.05202
作者: Xin Jin,Hongyu Zhu,Siyuan Li,Zedong Wang,Zicheng Liu,Chang Yu,Huafeng Qin,Stan Z. Li
关键词-EN: Deep Neural Networks, Deep Neural, Neural Networks, achieved thrilling breakthroughs, garnered increasing attention
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint V1 with 27 pages main text. Online project at this https URL

点击查看摘要

Abstract:As Deep Neural Networks have achieved thrilling breakthroughs in the past decade, data augmentations have garnered increasing attention as regularization techniques when massive labeled data are unavailable. Among existing augmentations, Mixup and relevant data-mixing methods that convexly combine selected samples and the corresponding labels are widely adopted because they yield high performances by generating data-dependent virtual data while easily migrating to various domains. This survey presents a comprehensive review of foundational mixup methods and their applications. We first elaborate on the training pipeline with mixup augmentations as a unified framework containing modules. A reformulated framework could contain various mixup methods and give intuitive operational procedures. Then, we systematically investigate the applications of mixup augmentations on vision downstream tasks, various data modalities, and some analysis \ theorems of mixup. Meanwhile, we conclude the current status and limitations of mixup research and point out further work for effective and efficient mixup augmentations. This survey can provide researchers with the current state of the art in mixup methods and provide some insights and guidance roles in the mixup arena. An online project with this survey is available at \urlthis https URL.

[CV-83] Lung-DETR: Deformable Detection Transformer for Sparse Lung Nodule Anomaly Detection

链接: https://arxiv.org/abs/2409.05200
作者: Hooman Ramezani,Dionne Aleman,Daniel Létourneau
关键词-EN: Accurate lung nodule, real-world settings due, Accurate lung, computed tomography, scan imagery
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate lung nodule detection for computed tomography (CT) scan imagery is challenging in real-world settings due to the sparse occurrence of nodules and similarity to other anatomical structures. In a typical positive case, nodules may appear in as few as 3% of CT slices, complicating detection. To address this, we reframe the problem as an anomaly detection task, targeting rare nodule occurrences in a predominantly normal dataset. We introduce a novel solution leveraging custom data preprocessing and Deformable Detection Transformer (Deformable- DETR). A 7.5mm Maximum Intensity Projection (MIP) is utilized to combine adjacent lung slices into single images, reducing the slice count and decreasing nodule sparsity. This enhances spatial context, allowing for better differentiation between nodules and other structures such as complex vascular structures and bronchioles. Deformable-DETR is employed to detect nodules, with a custom focal loss function to better handle the imbalanced dataset. Our model achieves state-of-the-art performance on the LUNA16 dataset with an F1 score of 94.2% (95.2% recall, 93.3% precision) on a dataset sparsely populated with lung nodules that is reflective of real-world clinical data.

[CV-84] Advanced Machine Learning Framework for Efficient Plant Disease Prediction

链接: https://arxiv.org/abs/2409.05174
作者: Aswath Muthuselvam,S. Sowdeshwar,M. Saravanan,Satheesh K. Perepu
关键词-EN: smart agriculture platforms, Machine Learning, smart agriculture, important component, agriculture platforms
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, Machine Learning (ML) methods are built-in as an important component in many smart agriculture platforms. In this paper, we explore the new combination of advanced ML methods for creating a smart agriculture platform where farmers could reach out for assistance from the public, or a closed circle of experts. Specifically, we focus on an easy way to assist the farmers in understanding plant diseases where the farmers can get help to solve the issues from the members of the community. The proposed system utilizes deep learning techniques for identifying the disease of the plant from the affected image, which acts as an initial identifier. Further, Natural Language Processing techniques are employed for ranking the solutions posted by the user community. In this paper, a message channel is built on top of Twitter, a popular social media platform to establish proper communication among farmers. Since the effect of the solutions can differ based on various other parameters, we extend the use of the concept drift approach and come up with a good solution and propose it to the farmer. We tested the proposed framework on the benchmark dataset, and it produces accurate and reliable results.

[CV-85] Exploring Fungal Morphology Simulation and Dynamic Light Containment from a Graphics Generation Perspective SIGGRAPH

链接: https://arxiv.org/abs/2409.05171
作者: Kexin Wang,Ivy He,Jinke Li,Ali Asadipour,Yitong Sun
关键词-EN: considered crucial techniques, Bio-Art creation, control are considered, considered crucial, crucial techniques
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Siggraph Asia 2024 Art Paper

点击查看摘要

Abstract:Fungal simulation and control are considered crucial techniques in Bio-Art creation. However, coding algorithms for reliable fungal simulations have posed significant challenges for artists. This study equates fungal morphology simulation to a two-dimensional graphic time-series generation problem. We propose a zero-coding, neural network-driven cellular automaton. Fungal spread patterns are learned through an image segmentation model and a time-series prediction model, which then supervise the training of neural network cells, enabling them to replicate real-world spreading behaviors. We further implemented dynamic containment of fungal boundaries with lasers. Synchronized with the automaton, the fungus successfully spreads into pre-designed complex shapes in reality.

[CV-86] CD-NGP: A Fast Scalable Continual Representation for Dynamic Scenes

链接: https://arxiv.org/abs/2409.05166
作者: Zhenhuan Liu,Shuai Liu,Zhiwei Ning,Jie Yang,Wei Liu
关键词-EN: present CD-NGP, dynamic scenes, fast and scalable, view synthesis, synthesis in dynamic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 23 pages, full version

点击查看摘要

Abstract:We present CD-NGP, which is a fast and scalable representation for 3D reconstruction and novel view synthesis in dynamic scenes. Inspired by continual learning, our method first segments input videos into multiple chunks, followed by training the model chunk by chunk, and finally, fuses features of the first branch and subsequent branches. Experiments on the prevailing DyNeRF dataset demonstrate that our proposed novel representation reaches a great balance between memory consumption, model size, training speed, and rendering quality. Specifically, our method consumes 85% less training memory ( 14 GB) than offline methods and requires significantly lower streaming bandwidth ( 0.4 MB/frame) than other online alternatives.

[CV-87] Can OOD Object Detectors Learn from Foundation Models?

链接: https://arxiv.org/abs/2409.05162
作者: Jiahui Liu,Xin Wen,Shizhen Zhao,Yingxian Chen,Xiaojuan Qi
关键词-EN: challenging task due, OOD object detection, OOD, open-set OOD data, enhancing OOD object
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 19 pages, 4 figures

点击查看摘要

Abstract:Out-of-distribution (OOD) object detection is a challenging task due to the absence of open-set OOD data. Inspired by recent advancements in text-to-image generative models, such as Stable Diffusion, we study the potential of generative models trained on large-scale open-set data to synthesize OOD samples, thereby enhancing OOD object detection. We introduce SyncOOD, a simple data curation method that capitalizes on the capabilities of large foundation models to automatically extract meaningful OOD data from text-to-image generative models. This offers the model access to open-world knowledge encapsulated within off-the-shelf foundation models. The synthetic OOD samples are then employed to augment the training of a lightweight, plug-and-play OOD detector, thus effectively optimizing the in-distribution (ID)/OOD decision boundaries. Extensive experiments across multiple benchmarks demonstrate that SyncOOD significantly outperforms existing methods, establishing new state-of-the-art performance with minimal synthetic data usage.

[CV-88] Image color consistency in datasets: the Smooth-TPS3D method

链接: https://arxiv.org/abs/2409.05159
作者: Ismael Benito-Altamirano,David Martínez-Carpena,Hanna Lizarzaburu-Aguilar,Carles Ventura,Cristian Fàbrega,Joan Daniel Prades
关键词-EN: digital imaging consistency, achieve image consistency, Image color consistency, key problem, problem in digital
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Instrumentation and Detectors (physics.ins-det); Optics (physics.optics)
*备注:

点击查看摘要

Abstract:Image color consistency is the key problem in digital imaging consistency when creating datasets. Here, we propose an improved 3D Thin-Plate Splines (TPS3D) color correction method to be used, in conjunction with color charts (i.e. Macbeth ColorChecker) or other machine-readable patterns, to achieve image consistency by post-processing. Also, we benchmark our method against its former implementation and the alternative methods reported to date with an augmented dataset based on the Gehler’s ColorChecker dataset. Benchmark includes how corrected images resemble the ground-truth images and how fast these implementations are. Results demonstrate that the TPS3D is the best candidate to achieve image consistency. Furthermore, our Smooth-TPS3D method shows equivalent results compared to the original method and reduced the 11-15% of ill-conditioned scenarios which the previous method failed to less than 1%. Moreover, we demonstrate that the Smooth-TPS method is 20% faster than the original method. Finally, we discuss how different methods offer different compromises between quality, correction accuracy and computational load.

[CV-89] Ultron: Enabling Temporal Geometry Compression of 3D Mesh Sequences using Temporal Correspondence and Mesh Deformation

链接: https://arxiv.org/abs/2409.05151
作者: Haichao Zhu
关键词-EN: computer vision, advancement of computer, significant progress, progress and found, reconstruction techniques
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:With the advancement of computer vision, dynamic 3D reconstruction techniques have seen significant progress and found applications in various fields. However, these techniques generate large amounts of 3D data sequences, necessitating efficient storage and transmission methods. Existing 3D model compression methods primarily focus on static models and do not consider inter-frame information, limiting their ability to reduce data size. Temporal mesh compression, which has received less attention, often requires all input meshes to have the same topology, a condition rarely met in real-world applications. This research proposes a method to compress mesh sequences with arbitrary topology using temporal correspondence and mesh deformation. The method establishes temporal correspondence between consecutive frames, applies a deformation model to transform the mesh from one frame to subsequent frames, and replaces the original meshes with deformed ones if the quality meets a tolerance threshold. Extensive experiments demonstrate that this method can achieve state-of-the-art performance in terms of compression performance. The contributions of this paper include a geometry and motion-based model for establishing temporal correspondence between meshes, a mesh quality assessment for temporal mesh sequences, an entropy-based encoding and corner table-based method for compressing mesh sequences, and extensive experiments showing the effectiveness of the proposed method. All the code will be open-sourced at this https URL.

[CV-90] Better Spanish Emotion Recognition In-the-wild: Bringing Attention to Deep Spectrum Voice Analysis

链接: https://arxiv.org/abs/2409.05148
作者: Elena Ortega-Beltrán,Josep Cabacas-Maso,Ismael Benito-Altamirano,Carles Ventura
关键词-EN: Socially Assistive Robots, Socially Assistive, Assistive Robots, key development factor, user emotional state
类目: ound (cs.SD); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Within the context of creating new Socially Assistive Robots, emotion recognition has become a key development factor, as it allows the robot to adapt to the user’s emotional state in the wild. In this work, we focused on the analysis of two voice recording Spanish datasets: ELRA-S0329 and EmoMatchSpanishDB. Specifically, we centered our work in the paralanguage, e.~g. the vocal characteristics that go along with the message and clarifies the meaning. We proposed the use of the DeepSpectrum method, which consists of extracting a visual representation of the audio tracks and feeding them to a pretrained CNN model. For the classification task, DeepSpectrum is often paired with a Support Vector Classifier --DS-SVC–, or a Fully-Connected deep-learning classifier --DS-FC–. We compared the results of the DS-SVC and DS-FC architectures with the state-of-the-art (SOTA) for ELRA-S0329 and EmoMatchSpanishDB. Moreover, we proposed our own classifier based upon Attention Mechanisms, namely DS-AM. We trained all models against both datasets, and we found that our DS-AM model outperforms the SOTA models for the datasets and the SOTA DeepSpectrum architectures. Finally, we trained our DS-AM model in one dataset and tested it in the other, to simulate real-world conditions on how biased is the model to the dataset.

[CV-91] anDepth: Leveraging Global DEMs for Metric Monocular Depth Estimation in UAVs

链接: https://arxiv.org/abs/2409.05142
作者: Horatiu Florea,Sergiu Nedevschi
关键词-EN: understanding systems face, systems face stringent, face stringent payload, stringent payload restrictions, inherently ill-posed problem
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Aerial scene understanding systems face stringent payload restrictions and must often rely on monocular depth estimation for modelling scene geometry, which is an inherently ill-posed problem. Moreover, obtaining accurate ground truth data required by learning-based methods raises significant additional challenges in the aerial domain. Self-supervised approaches can bypass this problem, at the cost of providing only up-to-scale results. Similarly, recent supervised solutions which make good progress towards zero-shot generalization also provide only relative depth values. This work presents TanDepth, a practical, online scale recovery method for obtaining metric depth results from relative estimations at inference-time, irrespective of the type of model generating them. Tailored for Unmanned Aerial Vehicle (UAV) applications, our method leverages sparse measurements from Global Digital Elevation Models (GDEM) by projecting them to the camera view using extrinsic and intrinsic information. An adaptation to the Cloth Simulation Filter is presented, which allows selecting ground points from the estimated depth map to then correlate with the projected reference points. We evaluate and compare our method against alternate scaling methods adapted for UAVs, on a variety of real-world scenes. Considering the limited availability of data for this domain, we construct and release a comprehensive, depth-focused extension to the popular UAVid dataset to further research.

[CV-92] READoc: A Unified Benchmark for Realistic Document Structured Extraction

链接: https://arxiv.org/abs/2409.05137
作者: Zichao Li,Aizier Abulaiti,Yaojie Lu,Xuanang Chen,Jia Zheng,Hongyu Lin,Xianpei Han,Le Sun
关键词-EN: Document Structured Extraction, extract structured content, Structured Extraction, extract structured, structured content
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Document Structured Extraction (DSE) aims to extract structured content from raw documents. Despite the emergence of numerous DSE systems, their unified evaluation remains inadequate, significantly hindering the field’s advancement. This problem is largely attributed to existing benchmark paradigms, which exhibit fragmented and localized characteristics. To address these limitations and offer a thorough evaluation of DSE systems, we introduce a novel benchmark named READoc, which defines DSE as a realistic task of converting unstructured PDFs into semantically rich Markdown. The READoc dataset is derived from 2,233 diverse and real-world documents from arXiv and GitHub. In addition, we develop a DSE Evaluation S ^3 uite comprising Standardization, Segmentation and Scoring modules, to conduct a unified evaluation of state-of-the-art DSE approaches. By evaluating a range of pipeline tools, expert visual models, and general VLMs, we identify the gap between current work and the unified, realistic DSE objective for the first time. We aspire that READoc will catalyze future research in DSE, fostering more comprehensive and practical solutions.

[CV-93] PdfTable: A Unified Toolkit for Deep Learning-Based Table Extraction

链接: https://arxiv.org/abs/2409.05125
作者: Lei Sheng,Shuai-Shuai Xu
关键词-EN: Portable Document Format, encompassing Portable Document, document data exists, encompassing Portable, Portable Document
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 19 pages, 4 figures

点击查看摘要

Abstract:Currently, a substantial volume of document data exists in an unstructured format, encompassing Portable Document Format (PDF) files and images. Extracting information from these documents presents formidable challenges due to diverse table styles, complex forms, and the inclusion of different languages. Several open-source toolkits, such as Camelot, Plumb a PDF (pdfnumber), and Paddle Paddle Structure V2 (PP-StructureV2), have been developed to facilitate table extraction from PDFs or images. However, each toolkit has its limitations. Camelot and pdfnumber can solely extract tables from digital PDFs and cannot handle image-based PDFs and pictures. On the other hand, PP-StructureV2 can comprehensively extract image-based PDFs and tables from pictures. Nevertheless, it lacks the ability to differentiate between diverse application scenarios, such as wired tables and wireless tables, digital PDFs, and image-based PDFs. To address these issues, we have introduced the PDF table extraction (PdfTable) toolkit. This toolkit integrates numerous open-source models, including seven table recognition models, four Optical character recognition (OCR) recognition tools, and three layout analysis models. By refining the PDF table extraction process, PdfTable achieves adaptability across various application scenarios. We substantiate the efficacy of the PdfTable toolkit through verification on a self-labeled wired table dataset and the open-source wireless Publicly Table Reconition Dataset (PubTabNet). The PdfTable code will available on Github: this https URL.

[CV-94] PMT: Progressive Mean Teacher via Exploring Temporal Consistency for Semi-Supervised Medical Image Segmentation ECCV2024

链接: https://arxiv.org/abs/2409.05122
作者: Ning Gao,Sanping Zhou,Le Wang,Nanning Zheng
关键词-EN: widely adopted technique, medical image segmentation, Semi-supervised learning, medical image, image segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV2024

点击查看摘要

Abstract:Semi-supervised learning has emerged as a widely adopted technique in the field of medical image segmentation. The existing works either focuses on the construction of consistency constraints or the generation of pseudo labels to provide high-quality supervisory signals, whose main challenge mainly comes from how to keep the continuous improvement of model capabilities. In this paper, we propose a simple yet effective semi-supervised learning framework, termed Progressive Mean Teachers (PMT), for medical image segmentation, whose goal is to generate high-fidelity pseudo labels by learning robust and diverse features in the training process. Specifically, our PMT employs a standard mean teacher to penalize the consistency of the current state and utilizes two sets of MT architectures for co-training. The two sets of MT architectures are individually updated for prolonged periods to maintain stable model diversity established through performance gaps generated by iteration differences. Additionally, a difference-driven alignment regularizer is employed to expedite the alignment of lagging models with the representation capabilities of leading models. Furthermore, a simple yet effective pseudo-label filtering algorithm is employed for facile evaluation of models and selection of high-fidelity pseudo-labels outputted when models are operating at high performance for co-training purposes. Experimental results on two datasets with different modalities, i.e., CT and MRI, demonstrate that our method outperforms the state-of-the-art medical image segmentation approaches across various dimensions. The code is available at this https URL.

[CV-95] DreamMapping: High-Fidelity Text-to-3D Generation via Variational Distribution Mapping

链接: https://arxiv.org/abs/2409.05099
作者: Zeyu Cai,Duotun Wang,Yixun Liang,Zhijing Shao,Ying-Cong Chen,Xiaohang Zhan,Zeyu Wang
关键词-EN: Score Distillation Sampling, Score Distillation, Distillation Sampling, distilling view-dependent information, content creation
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: 15 pages, 14 figures

点击查看摘要

Abstract:Score Distillation Sampling (SDS) has emerged as a prevalent technique for text-to-3D generation, enabling 3D content creation by distilling view-dependent information from text-to-2D guidance. However, they frequently exhibit shortcomings such as over-saturated color and excess smoothness. In this paper, we conduct a thorough analysis of SDS and refine its formulation, finding that the core design is to model the distribution of rendered images. Following this insight, we introduce a novel strategy called Variational Distribution Mapping (VDM), which expedites the distribution modeling process by regarding the rendered images as instances of degradation from diffusion-based generation. This special design enables the efficient training of variational distribution by skipping the calculations of the Jacobians in the diffusion U-Net. We also introduce timestep-dependent Distribution Coefficient Annealing (DCA) to further improve distilling precision. Leveraging VDM and DCA, we use Gaussian Splatting as the 3D representation and build a text-to-3D generation framework. Extensive experiments and evaluations demonstrate the capability of VDM and DCA to generate high-fidelity and realistic assets with optimization efficiency.

[CV-96] Leveraging WaveNet for Dynamic Listening Head Modeling from Speech

链接: https://arxiv.org/abs/2409.05089
作者: Minh-Duc Nguyen,Hyung-Jeong Yang,Seung-Won Kim,Ji-Eun Shin,Soo-Hyung Kim
关键词-EN: facial responses aims, simulate interactive communication, interactive communication feedback, listener facial responses, facial responses
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The creation of listener facial responses aims to simulate interactive communication feedback from a listener during a face-to-face conversation. Our goal is to generate believable videos of listeners’ heads that respond authentically to a single speaker by a sequence-to-sequence model with an combination of WaveNet and Long short-term memory network. Our approach focuses on capturing the subtle nuances of listener feedback, ensuring the preservation of individual listener identity while expressing appropriate attitudes and viewpoints. Experiment results show that our method surpasses the baseline models on ViCo benchmark Dataset.

[CV-97] ransformer with Leveraged Masked Autoencoder for video-based Pain Assessment

链接: https://arxiv.org/abs/2409.05088
作者: Minh-Duc Nguyen,Hyung-Jeong Yang,Soo-Hyung Kim,Ji-Eun Shin,Seung-Won Kim
关键词-EN: traditional methods relying, Accurate pain assessment, diagnosis and treatment, traditional methods, Accurate pain
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Accurate pain assessment is crucial in healthcare for effective diagnosis and treatment; however, traditional methods relying on self-reporting are inadequate for populations unable to communicate their pain. Cutting-edge AI is promising for supporting clinicians in pain recognition using facial video data. In this paper, we enhance pain recognition by employing facial video analysis within a Transformer-based deep learning model. By combining a powerful Masked Autoencoder with a Transformers-based classifier, our model effectively captures pain level indicators through both expressions and micro-expressions. We conducted our experiment on the AI4Pain dataset, which produced promising results that pave the way for innovative healthcare solutions that are both comprehensive and objective.

[CV-98] PIP: Detecting Adversarial Examples in Large Vision-Language Models via Attention Patterns of Irrelevant Probe Questions

链接: https://arxiv.org/abs/2409.05076
作者: Yudong Zhang,Ruobing Xie,Jiansheng Chen,Xingwu Sun,Yu Wang
关键词-EN: Large Vision-Language Models, powerful multimodal capabilities, Large Vision-Language, Vision-Language Models, multimodal capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by ACM Multimedia 2024 BNI track (Oral)

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have demonstrated their powerful multimodal capabilities. However, they also face serious safety problems, as adversaries can induce robustness issues in LVLMs through the use of well-designed adversarial examples. Therefore, LVLMs are in urgent need of detection tools for adversarial examples to prevent incorrect responses. In this work, we first discover that LVLMs exhibit regular attention patterns for clean images when presented with probe questions. We propose an unconventional method named PIP, which utilizes the attention patterns of one randomly selected irrelevant probe question (e.g., “Is there a clock?”) to distinguish adversarial examples from clean examples. Regardless of the image to be tested and its corresponding question, PIP only needs to perform one additional inference of the image to be tested and the probe question, and then achieves successful detection of adversarial examples. Even under black-box attacks and open dataset scenarios, our PIP, coupled with a simple SVM, still achieves more than 98% recall and a precision of over 90%. Our PIP is the first attempt to detect adversarial attacks on LVLMs via simple irrelevant probe questions, shedding light on deeper understanding and introspection within LVLMs. The code is available at this https URL.

[CV-99] Sight View Constraint for Robust Point Cloud Registration

链接: https://arxiv.org/abs/2409.05065
作者: Yaojie Zhang,Weijun Wang,Tianlun Huang,Zhiyong Wang,Wei Feng
关键词-EN: Partial Point Cloud, Point Cloud Registration, Point Cloud, low overlap rate, partial PCR
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages

点击查看摘要

Abstract:Partial to Partial Point Cloud Registration (partial PCR) remains a challenging task, particularly when dealing with a low overlap rate. In comparison to the full-to-full registration task, we find that the objective of partial PCR is still not well-defined, indicating no metric can reliably identify the true transformation. We identify this as the most fundamental challenge in partial PCR tasks. In this paper, instead of directly seeking the optimal transformation, we propose a novel and general Sight View Constraint (SVC) to conclusively identify incorrect transformations, thereby enhancing the robustness of existing PCR methods. Extensive experiments validate the effectiveness of SVC on both indoor and outdoor scenes. On the challenging 3DLoMatch dataset, our approach increases the registration recall from 78% to 82%, achieving the state-of-the-art result. This research also highlights the significance of the decision version problem of partial PCR, which has the potential to provide novel insights into the partial PCR problem.

[CV-100] Unsupervised Multimodal 3D Medical Image Registration with Multilevel Correlation Balanced Optimization MICCAI

链接: https://arxiv.org/abs/2409.05040
作者: Jiazheng Wang,Xiang Chen,Yuxi Zhang,Min Liu,Yaonan Wang,Hang Zhang
关键词-EN: Surgical navigation based, multimodal image registration, critical anatomical structures, providing intraoperative guidance, Surgical navigation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Method description for MICCAI Learn2Reg 2024 challenge

点击查看摘要

Abstract:Surgical navigation based on multimodal image registration has played a significant role in providing intraoperative guidance to surgeons by showing the relative position of the target area to critical anatomical structures during surgery. However, due to the differences between multimodal images and intraoperative image deformation caused by tissue displacement and removal during the surgery, effective registration of preoperative and intraoperative multimodal images faces significant challenges. To address the multimodal image registration challenges in Learn2Reg 2024, an unsupervised multimodal medical image registration method based on multilevel correlation balanced optimization (MCBO) is designed to solve these problems. First, the features of each modality are extracted based on the modality independent neighborhood descriptor, and the multimodal images is mapped to the feature space. Second, a multilevel pyramidal fusion optimization mechanism is designed to achieve global optimization and local detail complementation of the deformation field through dense correlation analysis and weight-balanced coupled convex optimization for input features at different scales. For preoperative medical images in different modalities, the alignment and stacking of valid information between different modalities is achieved by the maximum fusion between deformation fields. Our method focuses on the ReMIND2Reg task in Learn2Reg 2024, and to verify the generality of the method, we also tested it on the COMULIS3DCLEM task. Based on the results, our method achieved second place in the validation of both two tasks.

[CV-101] Deep Self-cleansing for Medical Image Segmentation with Noisy Labels

链接: https://arxiv.org/abs/2409.05024
作者: Jiahua Dong,Yue Zhang,Qiuli Wang,Ruofeng Tong,Shihong Ying,Shaolin Gong,Xuanpu Zhang,Lanfen Lin,Yen-Wei Chen
关键词-EN: Medical image segmentation, medical imaging, Medical image, aiding in disease, surgical planning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:Medical image segmentation is crucial in the field of medical imaging, aiding in disease diagnosis and surgical planning. Most established segmentation methods rely on supervised deep learning, in which clean and precise labels are essential for supervision and significantly impact the performance of models. However, manually delineated labels often contain noise, such as missing labels and inaccurate boundary delineation, which can hinder networks from correctly modeling target characteristics. In this paper, we propose a deep self-cleansing segmentation framework that can preserve clean labels while cleansing noisy ones in the training phase. To achieve this, we devise a gaussian mixture model-based label filtering module that distinguishes noisy labels from clean labels. Additionally, we develop a label cleansing module to generate pseudo low-noise labels for identified noisy samples. The preserved clean labels and pseudo-labels are then used jointly to supervise the network. Validated on a clinical liver tumor dataset and a public cardiac diagnosis dataset, our method can effectively suppress the interference from noisy labels and achieve prominent segmentation performance.

[CV-102] owards Patronizing and Condescending Language in Chinese Videos: A Multimodal Dataset and Detector ICASSP2025

链接: https://arxiv.org/abs/2409.05005
作者: Hongbo Wang,Junyu Lu,Yan Han,Liang Yang,Hongfei Lin
关键词-EN: Patronizing and Condescending, Condescending Language, targeting vulnerable groups, toxic speech targeting, threatening both online
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Under review in ICASSP 2025

点击查看摘要

Abstract:Patronizing and Condescending Language (PCL) is a form of discriminatory toxic speech targeting vulnerable groups, threatening both online and offline safety. While toxic speech research has mainly focused on overt toxicity, such as hate speech, microaggressions in the form of PCL remain underexplored. Additionally, dominant groups’ discriminatory facial expressions and attitudes toward vulnerable communities can be more impactful than verbal cues, yet these frame features are often overlooked. In this paper, we introduce the PCLMM dataset, the first Chinese multimodal dataset for PCL, consisting of 715 annotated videos from Bilibili, with high-quality PCL facial frame spans. We also propose the MultiPCL detector, featuring a facial expression detection module for PCL recognition, demonstrating the effectiveness of modality complementarity in this challenging task. Our work makes an important contribution to advancing microaggression detection within the domain of toxic speech.

[CV-103] Visual Grounding with Multi-modal Conditional Adaptation ACM-MM2024

链接: https://arxiv.org/abs/2409.04999
作者: Ruilin Yao,Shengwu Xiong,Yichen Zhao,Yi Rong
关键词-EN: natural language expressions, Visual, language expressions, natural language, Visual grounding
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: Accepted by ACM MM 2024 [Oral]

点击查看摘要

Abstract:Visual grounding is the task of locating objects specified by natural language expressions. Existing methods extend generic object detection frameworks to tackle this task. They typically extract visual and textual features separately using independent visual and textual encoders, then fuse these features in a multi-modal decoder for final prediction. However, visual grounding presents unique challenges. It often involves locating objects with different text descriptions within the same image. Existing methods struggle with this task because the independent visual encoder produces identical visual features for the same image, limiting detection performance. Some recently approaches propose various language-guided visual encoders to address this issue, but they mostly rely solely on textual information and require sophisticated designs. In this paper, we introduce Multi-modal Conditional Adaptation (MMCA), which enables the visual encoder to adaptively update weights, directing its focus towards text-relevant regions. Specifically, we first integrate information from different modalities to obtain multi-modal embeddings. Then we utilize a set of weighting coefficients, which generated from the multimodal embeddings, to reorganize the weight update matrices and apply them to the visual encoder of the visual grounding model. Extensive experiments on four widely used datasets demonstrate that MMCA achieves significant improvements and state-of-the-art results. Ablation experiments further demonstrate the lightweight and efficiency of our method. Our source code is available at: this https URL.

[CV-104] 2DSig-Detect: a semi-supervised framework for anomaly detection on image data using 2D-signatures

链接: https://arxiv.org/abs/2409.04982
作者: Xinheng Xie,Kureha Yamaguchi,Margaux Leblanc,Simon Malzard,Varun Chhabra,Victoria Nockles,Yue Wu
关键词-EN: machine learning technologies, machine learning models, learning technologies raises, technologies raises questions, machine learning
类目: Computer Vision and Pattern Recognition (cs.CV); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The rapid advancement of machine learning technologies raises questions about the security of machine learning models, with respect to both training-time (poisoning) and test-time (evasion, impersonation, and inversion) attacks. Models performing image-related tasks, e.g. detection, and classification, are vulnerable to adversarial attacks that can degrade their performance and produce undesirable outcomes. This paper introduces a novel technique for anomaly detection in images called 2DSig-Detect, which uses a 2D-signature-embedded semi-supervised framework rooted in rough path theory. We demonstrate our method in adversarial settings for training-time and test-time attacks, and benchmark our framework against other state of the art methods. Using 2DSig-Detect for anomaly detection, we show both superior performance and a reduction in the computation time to detect the presence of adversarial perturbations in images.

[CV-105] Multi-V2X: A Large Scale Multi-modal Multi-penetration-rate Dataset for Cooperative Perception

链接: https://arxiv.org/abs/2409.04980
作者: Rongsong Li,Xin Pei
关键词-EN: garnered significant attention, recent years due, enhance long-distance perception, garnered significant, significant attention
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Cooperative perception through vehicle-to-everything (V2X) has garnered significant attention in recent years due to its potential to overcome occlusions and enhance long-distance perception. Great achievements have been made in both datasets and algorithms. However, existing real-world datasets are limited by the presence of few communicable agents, while synthetic datasets typically cover only vehicles. More importantly, the penetration rate of connected and autonomous vehicles (CAVs) , a critical factor for the deployment of cooperative perception technologies, has not been adequately addressed. To tackle these issues, we introduce Multi-V2X, a large-scale, multi-modal, multi-penetration-rate dataset for V2X perception. By co-simulating SUMO and CARLA, we equip a substantial number of cars and roadside units (RSUs) in simulated towns with sensor suites, and collect comprehensive sensing data. Datasets with specified CAV penetration rates can be obtained by masking some equipped cars as normal vehicles. In total, our Multi-V2X dataset comprises 549k RGB frames, 146k LiDAR frames, and 4,219k annotated 3D bounding boxes across six categories. The highest possible CAV penetration rate reaches 86.21%, with up to 31 agents in communication range, posing new challenges in selecting agents to collaborate with. We provide comprehensive benchmarks for cooperative 3D object detection tasks. Our data and code are available at this https URL .

[CV-106] RCBEVDet: Toward High-accuracy Radar-Camera Fusion 3D Perception Network CVPR2024

链接: https://arxiv.org/abs/2409.04979
作者: Zhiwei Lin,Zhe Liu,Yongtao Wang,Le Zhang,Ce Zhu
关键词-EN: Perceiving the surrounding, autonomous driving, autonomous driving systems, modern autonomous driving, surrounding environment
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The extended work of RCBEVDet (CVPR2024)

点击查看摘要

Abstract:Perceiving the surrounding environment is a fundamental task in autonomous driving. To obtain highly accurate perception results, modern autonomous driving systems typically employ multi-modal sensors to collect comprehensive environmental data. Among these, the radar-camera multi-modal perception system is especially favored for its excellent sensing capabilities and cost-effectiveness. However, the substantial modality differences between radar and camera sensors pose challenges in fusing information. To address this problem, this paper presents RCBEVDet, a radar-camera fusion 3D object detection framework. Specifically, RCBEVDet is developed from an existing camera-based 3D object detector, supplemented by a specially designed radar feature extractor, RadarBEVNet, and a Cross-Attention Multi-layer Fusion (CAMF) module. Firstly, RadarBEVNet encodes sparse radar points into a dense bird’s-eye-view (BEV) feature using a dual-stream radar backbone and a Radar Cross Section aware BEV encoder. Secondly, the CAMF module utilizes a deformable attention mechanism to align radar and camera BEV features and adopts channel and spatial fusion layers to fuse them. To further enhance RCBEVDet’s capabilities, we introduce RCBEVDet++, which advances the CAMF through sparse fusion, supports query-based multi-view camera perception models, and adapts to a broader range of perception tasks. Extensive experiments on the nuScenes show that our method integrates seamlessly with existing camera-based 3D perception models and improves their performance across various perception tasks. Furthermore, our method achieves state-of-the-art radar-camera fusion results in 3D object detection, BEV semantic segmentation, and 3D multi-object tracking tasks. Notably, with ViT-L as the image backbone, RCBEVDet++ achieves 72.73 NDS and 67.34 mAP in 3D object detection without test-time augmentation or model ensembling.

[CV-107] me-independent Spiking Neuron via Membrane Potential Estimation for Efficient Spiking Neural Networks

链接: https://arxiv.org/abs/2409.04978
作者: Hanqi Chen,Lixing Yu,Shaojie Zhan,Penghui Yao,Jiankun Shao
关键词-EN: artificial neural networks, spiking neural networks, neural networks, extended encoding periods, encoding periods compared
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The computational inefficiency of spiking neural networks (SNNs) is primarily due to the sequential updates of membrane potential, which becomes more pronounced during extended encoding periods compared to artificial neural networks (ANNs). This highlights the need to parallelize SNN computations effectively to leverage available hardware parallelism. To address this, we propose Membrane Potential Estimation Parallel Spiking Neurons (MPE-PSN), a parallel computation method for spiking neurons that enhances computational efficiency by enabling parallel processing while preserving the intrinsic dynamic characteristics of SNNs. Our approach exhibits promise for enhancing computational efficiency, particularly under conditions of elevated neuron density. Empirical experiments demonstrate that our method achieves state-of-the-art (SOTA) accuracy and efficiency on neuromorphic datasets without requiring additional training parameters. Codes are available at~\urlthis https URL.

[CV-108] Enhancing Convolutional Neural Networks with Higher-Order Numerical Difference Methods

链接: https://arxiv.org/abs/2409.04977
作者: Qi Wang,Zijun Gao,Mingxiu Sui,Taiyuan Mei,Xiaohan Cheng,Iris Li
关键词-EN: Convolutional Neural Networks, Convolutional Neural, deep learning technology, practical applications, real-world problems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the rise of deep learning technology in practical applications, Convolutional Neural Networks (CNNs) have been able to assist humans in solving many real-world problems. To enhance the performance of CNNs, numerous network architectures have been explored. Some of these architectures are designed based on the accumulated experience of researchers over time, while others are designed through neural architecture search methods. The improvements made to CNNs by the aforementioned methods are quite significant, but most of the improvement methods are limited in reality by model size and environmental constraints, making it difficult to fully realize the improved performance. In recent years, research has found that many CNN structures can be explained by the discretization of ordinary differential equations. This implies that we can design theoretically supported deep network structures using higher-order numerical difference methods. It should be noted that most of the previous CNN model structures are based on low-order numerical methods. Therefore, considering that the accuracy of linear multi-step numerical difference methods is higher than that of the forward Euler method, this paper proposes a stacking scheme based on the linear multi-step method. This scheme enhances the performance of ResNet without increasing the model size and compares it with the Runge-Kutta scheme. The experimental results show that the performance of the stacking scheme proposed in this paper is superior to existing stacking schemes (ResNet and HO-ResNet), and it has the capability to be extended to other types of neural networks.

[CV-109] PatchAlign:Fair and Accurate Skin Disease Image Classification by Alignment with Clinical Labels MICCAI2024

链接: https://arxiv.org/abs/2409.04975
作者: Aayushman,Hemanth Gaddey,Vidhi Mittal,Manisha Chawla,Gagan Raj Gupta
关键词-EN: achieved great success, Graph Optimal Transport, Deep learning models, Deep learning, Masked Graph Optimal
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: MICCAI 2024. Early Accept Paper (amongst the top 11% of 2869 papers submitted)

点击查看摘要

Abstract:Deep learning models have achieved great success in automating skin lesion diagnosis. However, the ethnic disparity in these models’ predictions needs to be addressed before deploying them. We introduce a novel approach, PatchAlign, to enhance skin condition image classification accuracy and fairness by aligning with clinical text representations of skin conditions. PatchAlign uses Graph Optimal Transport (GOT) Loss as a regularizer to perform cross-domain alignment. The representations obtained are robust and generalize well across skin tones, even with limited training samples. To reduce the effect of noise and artifacts in clinical dermatology images, we propose a learnable Masked Graph Optimal Transport for cross-domain alignment that further improves fairness metrics. We compare our model to the state-of-the-art FairDisCo on two skin lesion datasets with different skin types: Fitzpatrick17k and Diverse Dermatology Images (DDI). PatchAlign enhances the accuracy of skin condition image classification by 2.8% (in-domain) and 6.2% (out-domain) on Fitzpatrick17k, and 4.2% (in-domain) on DDI compared to FairDisCo. Additionally, it consistently improves the fairness of true positive rates across skin tones. The source code for the implementation is available at the following GitHub repository: this https URL, enabling easy reproduction and further experimentation. Comments: MICCAI 2024. Early Accept Paper (amongst the top 11% of 2869 papers submitted) Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2409.04975 [cs.CV] (or arXiv:2409.04975v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.04975 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-110] Natias: Neuron Attribution based Transferable Image Adversarial Steganography

链接: https://arxiv.org/abs/2409.04968
作者: Zexin Fan,Kejiang Chen,Kai Zeng,Jiansong Zhang,Weiming Zhang,Nenghai Yu
关键词-EN: conceal secret messages, secret messages, digital images, messages within digital, adversarial steganography
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注: Accepted by IEEE TIFS

点击查看摘要

Abstract:Image steganography is a technique to conceal secret messages within digital images. Steganalysis, on the contrary, aims to detect the presence of secret messages within images. Recently, deep-learning-based steganalysis methods have achieved excellent detection performance. As a countermeasure, adversarial steganography has garnered considerable attention due to its ability to effectively deceive deep-learning-based steganalysis. However, steganalysts often employ unknown steganalytic models for detection. Therefore, the ability of adversarial steganography to deceive non-target steganalytic models, known as transferability, becomes especially important. Nevertheless, existing adversarial steganographic methods do not consider how to enhance transferability. To address this issue, we propose a novel adversarial steganographic scheme named Natias. Specifically, we first attribute the output of a steganalytic model to each neuron in the target middle layer to identify critical features. Next, we corrupt these critical features that may be adopted by diverse steganalytic models. Consequently, it can promote the transferability of adversarial steganography. Our proposed method can be seamlessly integrated with existing adversarial steganography frameworks. Thorough experimental analyses affirm that our proposed technique possesses improved transferability when contrasted with former approaches, and it attains heightened security in retraining scenarios.

[CV-111] GS-PT: Exploiting 3D Gaussian Splatting for Comprehensive Point Cloud Understanding via Self-supervised Learning

链接: https://arxiv.org/abs/2409.04963
作者: Keyi Liu,Yeqi Luo,Weidong Yang,Jingyi Xu,Zhijun Li,Wen-Ming Chen,Ben Fei
关键词-EN: learn meaningful representations, manual annotations, Self-supervised learning, learn meaningful, meaningful representations
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Self-supervised learning of point cloud aims to leverage unlabeled 3D data to learn meaningful representations without reliance on manual annotations. However, current approaches face challenges such as limited data diversity and inadequate augmentation for effective feature learning. To address these challenges, we propose GS-PT, which integrates 3D Gaussian Splatting (3DGS) into point cloud self-supervised learning for the first time. Our pipeline utilizes transformers as the backbone for self-supervised pre-training and introduces novel contrastive learning tasks through 3DGS. Specifically, the transformers aim to reconstruct the masked point cloud. 3DGS utilizes multi-view rendered images as input to generate enhanced point cloud distributions and novel view images, facilitating data augmentation and cross-modal contrastive learning. Additionally, we incorporate features from depth maps. By optimizing these tasks collectively, our method enriches the tri-modal self-supervised learning process, enabling the model to leverage the correlation across 3D point clouds and 2D images from various modalities. We freeze the encoder after pre-training and test the model’s performance on multiple downstream tasks. Experimental results indicate that GS-PT outperforms the off-the-shelf self-supervised learning methods on various downstream tasks including 3D object classification, real-world classifications, and few-shot learning and segmentation.

[CV-112] DDNet: Deformable Convolution and Dense FPN for Surface Defect Detection in Recycled Books

链接: https://arxiv.org/abs/2409.04958
作者: Jun Yu,WenJian Wang
关键词-EN: second-hand goods market, worth largely dependent, reused textbooks, hold significant, goods market
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recycled and recirculated books, such as ancient texts and reused textbooks, hold significant value in the second-hand goods market, with their worth largely dependent on surface preservation. However, accurately assessing surface defects is challenging due to the wide variations in shape, size, and the often imprecise detection of defects. To address these issues, we propose DDNet, an innovative detection model designed to enhance defect localization and classification. DDNet introduces a surface defect feature extraction module based on a deformable convolution operator (DC) and a densely connected FPN module (DFPN). The DC module dynamically adjusts the convolution grid to better align with object contours, capturing subtle shape variations and improving boundary delineation and prediction accuracy. Meanwhile, DFPN leverages dense skip connections to enhance feature fusion, constructing a hierarchical structure that generates multi-resolution, high-fidelity feature maps, thus effectively detecting defects of various sizes. In addition to the model, we present a comprehensive dataset specifically curated for surface defect detection in recycled and recirculated books. This dataset encompasses a diverse range of defect types, shapes, and sizes, making it ideal for evaluating the robustness and effectiveness of defect detection models. Through extensive evaluations, DDNet achieves precise localization and classification of surface defects, recording a mAP value of 46.7% on our proprietary dataset - an improvement of 14.2% over the baseline model - demonstrating its superior detection capabilities.

[CV-113] Deep Bayesian Active Learning-to-Rank with Relative Annotation for Estimation of Ulcerative Colitis Severity

链接: https://arxiv.org/abs/2409.04952
作者: Takeaki Kadota,Hideaki Hayashi,Ryoma Bise,Kiyohito Tanaka,Seiichi Uchida
关键词-EN: Automatic image-based severity, Automatic image-based, image-based severity estimation, computer-aided diagnosis, important task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 8 figures, accepted in Medical Image Analysis 2024

点击查看摘要

Abstract:Automatic image-based severity estimation is an important task in computer-aided diagnosis. Severity estimation by deep learning requires a large amount of training data to achieve a high performance. In general, severity estimation uses training data annotated with discrete (i.e., quantized) severity labels. Annotating discrete labels is often difficult in images with ambiguous severity, and the annotation cost is high. In contrast, relative annotation, in which the severity between a pair of images is compared, can avoid quantizing severity and thus makes it easier. We can estimate relative disease severity using a learning-to-rank framework with relative annotations, but relative annotation has the problem of the enormous number of pairs that can be annotated. Therefore, the selection of appropriate pairs is essential for relative annotation. In this paper, we propose a deep Bayesian active learning-to-rank that automatically selects appropriate pairs for relative annotation. Our method preferentially annotates unlabeled pairs with high learning efficiency from the model uncertainty of the samples. We prove the theoretical basis for adapting Bayesian neural networks to pairwise learning-to-rank and demonstrate the efficiency of our method through experiments on endoscopic images of ulcerative colitis on both private and public datasets. We also show that our method achieves a high performance under conditions of significant class imbalance because it automatically selects samples from the minority classes.

[CV-114] Attention-Based Efficient Breath Sound Removal in Studio Audio Recordings

链接: https://arxiv.org/abs/2409.04949
作者: Nidula Elgiriyewithana,N. D.Kodikara
关键词-EN: attention U-Net architecture, non-speech vocal sounds, specifically breath sounds, vocal recordings, non-speech vocal
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:In this research, we present an innovative, parameter-efficient model that utilizes the attention U-Net architecture for the automatic detection and eradication of non-speech vocal sounds, specifically breath sounds, in vocal recordings. This task is of paramount importance in the field of sound engineering, despite being relatively under-explored. The conventional manual process for detecting and eliminating these sounds requires significant expertise and is extremely time-intensive. Existing automated detection and removal methods often fall short in terms of efficiency and precision. Our proposed model addresses these limitations by offering a streamlined process and superior accuracy, achieved through the application of advanced deep learning techniques. A unique dataset, derived from Device and Produced Speech (DAPS), was employed for this purpose. The training phase of the model emphasizes a log spectrogram and integrates an early stopping mechanism to prevent overfitting. Our model not only conserves precious time for sound engineers but also enhances the quality and consistency of audio production. This constitutes a significant breakthrough, as evidenced by its comparative efficiency, necessitating only 1.9M parameters and a training duration of 3.2 hours - markedly less than the top-performing models in this domain. The model is capable of generating identical outputs as previous models with drastically improved precision, making it an optimal choice.

[CV-115] Fast Deep Predictive Coding Networks for Videos Feature Extraction without Labels

链接: https://arxiv.org/abs/2409.04945
作者: Wenqian Xue,Chi Ding,Jose Principe
关键词-EN: Brain-inspired deep predictive, predictive coding networks, bi-directional information flow, deep predictive coding, Brain-inspired deep
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Brain-inspired deep predictive coding networks (DPCNs) effectively model and capture video features through a bi-directional information flow, even without labels. They are based on an overcomplete description of video scenes, and one of the bottlenecks has been the lack of effective sparsification techniques to find discriminative and robust dictionaries. FISTA has been the best alternative. This paper proposes a DPCN with a fast inference of internal model variables (states and causes) that achieves high sparsity and accuracy of feature clustering. The proposed unsupervised learning procedure, inspired by adaptive dynamic programming with a majorization-minimization framework, and its convergence are rigorously analyzed. Experiments in the data sets CIFAR-10, Super Mario Bros video game, and Coil-100 validate the approach, which outperforms previous versions of DPCNs on learning rate, sparsity ratio, and feature clustering accuracy. Because of DCPN’s solid foundation and explainability, this advance opens the door for general applications in object recognition in video without labels.

[CV-116] MoistNet: Machine Vision-based Deep Learning Models for Wood Chip Moisture Content Measurement

链接: https://arxiv.org/abs/2409.04920
作者: Abdur Rahman,Jason Street,James Wooten,Mohammad Marufuzzaman,Veera G. Gude,Randy Buchanan,Haifeng Wang
关键词-EN: numerous forest-reliant industries, Quick and reliable, moisture content, chip moisture content, wood chip moisture
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Quick and reliable measurement of wood chip moisture content is an everlasting problem for numerous forest-reliant industries such as biofuel, pulp and paper, and bio-refineries. Moisture content is a critical attribute of wood chips due to its direct relationship with the final product quality. Conventional techniques for determining moisture content, such as oven-drying, possess some drawbacks in terms of their time-consuming nature, potential sample damage, and lack of real-time feasibility. Furthermore, alternative techniques, including NIR spectroscopy, electrical capacitance, X-rays, and microwaves, have demonstrated potential; nevertheless, they are still constrained by issues related to portability, precision, and the expense of the required equipment. Hence, there is a need for a moisture content determination method that is instant, portable, non-destructive, inexpensive, and precise. This study explores the use of deep learning and machine vision to predict moisture content classes from RGB images of wood chips. A large-scale image dataset comprising 1,600 RGB images of wood chips has been collected and annotated with ground truth labels, utilizing the results of the oven-drying technique. Two high-performing neural networks, MoistNetLite and MoistNetMax, have been developed leveraging Neural Architecture Search (NAS) and hyperparameter optimization. The developed models are evaluated and compared with state-of-the-art deep learning models. Results demonstrate that MoistNetLite achieves 87% accuracy with minimal computational overhead, while MoistNetMax exhibits exceptional precision with a 91% accuracy in wood chip moisture content class prediction. With improved accuracy and faster prediction speed, our proposed MoistNet models hold great promise for the wood chip processing industry.

[CV-117] raining-free ZS-CIR via Weighted Modality Fusion and Similarity

链接: https://arxiv.org/abs/2409.04918
作者: Ren-Di Wu,Yu-Yen Lin,Huei-Fang Yang
关键词-EN: capture users’ intentions, Composed image retrieval, image search due, Composed image, users’ intentions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 3 figures

点击查看摘要

Abstract:Composed image retrieval (CIR), which formulates the query as a combination of a reference image and modified text, has emerged as a new form of image search due to its enhanced ability to capture users’ intentions. However, training a CIR model in a supervised manner typically requires labor-intensive collection of (reference image, text modifier, target image) triplets. While existing zero-shot CIR (ZS-CIR) methods eliminate the need for training on specific downstream datasets, they still require additional pretraining with large-scale image-text pairs. In this paper, we introduce a training-free approach for ZS-CIR. Our approach, \textbfWeighted \textbfModality fusion and similarity for \textbfCIR (WeiMoCIR), operates under the assumption that image and text modalities can be effectively combined using a simple weighted average. This allows the query representation to be constructed directly from the reference image and text modifier. To further enhance retrieval performance, we employ multimodal large language models (MLLMs) to generate image captions for the database images and incorporate these textual captions into the similarity computation by combining them with image information using a weighted average. Our approach is simple, easy to implement, and its effectiveness is validated through experiments on the FashionIQ and CIRR datasets.

[CV-118] Activation Function Optimization Scheme for Image Classification

链接: https://arxiv.org/abs/2409.04915
作者: Abdur Rahman,Lu He,Haifeng Wang
关键词-EN: activation functions, Activation, functions, significant impact, Error Linear Unit
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Activation function has a significant impact on the dynamics, convergence, and performance of deep neural networks. The search for a consistent and high-performing activation function has always been a pursuit during deep learning model development. Existing state-of-the-art activation functions are manually designed with human expertise except for Swish. Swish was developed using a reinforcement learning-based search strategy. In this study, we propose an evolutionary approach for optimizing activation functions specifically for image classification tasks, aiming to discover functions that outperform current state-of-the-art options. Through this optimization framework, we obtain a series of high-performing activation functions denoted as Exponential Error Linear Unit (EELU). The developed activation functions are evaluated for image classification tasks from two perspectives: (1) five state-of-the-art neural network architectures, such as ResNet50, AlexNet, VGG16, MobileNet, and Compact Convolutional Transformer which cover computationally heavy to light neural networks, and (2) eight standard datasets, including CIFAR10, Imagenette, MNIST, Fashion MNIST, Beans, Colorectal Histology, CottonWeedID15, and TinyImageNet which cover from typical machine vision benchmark, agricultural image applications to medical image applications. Finally, we statistically investigate the generalization of the resultant activation functions developed through the optimization scheme. With a Friedman test, we conclude that the optimization scheme is able to generate activation functions that outperform the existing standard ones in 92.8% cases among 28 different cases studied, and -x\cdot erf(e^-x) is found to be the best activation function for image classification generated by the optimization scheme.

[CV-119] A Quantitative Approach for Evaluating Disease Focus and Interpretability of Deep Learning Models for Alzheimers Disease Classification

链接: https://arxiv.org/abs/2409.04888
作者: Thomas Yu Chow Tam,Litian Liang,Ke Chen,Haohan Wang,Wei Wu
关键词-EN: Alzheimer Disease, shown significant potential, Deep learning, potential in Alzheimer, models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep learning (DL) models have shown significant potential in Alzheimer’s Disease (AD) classification. However, understanding and interpreting these models remains challenging, which hinders the adoption of these models in clinical practice. Techniques such as saliency maps have been proven effective in providing visual and empirical clues about how these models work, but there still remains a gap in understanding which specific brain regions DL models focus on and whether these brain regions are pathologically associated with AD. To bridge such gap, in this study, we developed a quantitative disease-focusing strategy to first enhance the interpretability of DL models using saliency maps and brain segmentations; then we propose a disease-focus (DF) score that quantifies how much a DL model focuses on brain areas relevant to AD pathology based on clinically known MRI-based pathological regions of AD. Using this strategy, we compared several state-of-the-art DL models, including a baseline 3D ResNet model, a pretrained MedicalNet model, and a MedicalNet with data augmentation to classify patients with AD vs. cognitive normal patients using MRI data; then we evaluated these models in terms of their abilities to focus on disease-relevant regions. Our results show interesting disease-focusing patterns with different models, particularly characteristic patterns with the pretrained models and data augmentation, and also provide insight into their classification performance. These results suggest that the approach we developed for quantitatively assessing the abilities of DL models to focus on disease-relevant regions may help improve interpretability of these models for AD classification and facilitate their adoption for AD diagnosis in clinical practice. The code is publicly available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.04888 [cs.CV] (or arXiv:2409.04888v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.04888 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-120] Contrastive Disentangling: Fine-grained representation learning through multi-level contrastive learning without class priors

链接: https://arxiv.org/abs/2409.04867
作者: Houwang Jiang,Zhuxian Liu,Guodong Liu,Xiaolong Liu,Shihua Zhan
关键词-EN: Recent advancements, leverage class information, enhance feature extraction, clustering performance, class priors
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in unsupervised representation learning often leverage class information to enhance feature extraction and clustering performance. However, this reliance on class priors limits the applicability of such methods in real-world scenarios where class information is unavailable or ambiguous. In this paper, we propose Contrastive Disentangling (CD), a simple and effective framework that learns representations without any reliance on class priors. Our framework employs a multi-level contrastive learning strategy that combines instance-level and feature-level losses with a normalized entropy loss to learn semantically rich and fine-grained representations. Specifically, (1) the instance-level contrastive loss encourages the separation of feature representations for different samples, (2) the feature-level contrastive loss promotes independence among the feature head predictions, and (3) the normalized entropy loss encourages the feature heads to capture meaningful and prevalent attributes from the data. These components work together to enable CD to significantly outperform existing methods, as demonstrated by extensive experiments on benchmark datasets including CIFAR-10, CIFAR-100, STL-10, and ImageNet-10, particularly in scenarios where class priors are absent. The code is available at this https URL.

[CV-121] AdaptiveFusion: Adaptive Multi-Modal Multi-View Fusion for 3D Human Body Reconstruction

链接: https://arxiv.org/abs/2409.04851
作者: Anjun Chen,Xiangyu Wang,Zhi Xu,Kun Shi,Yan Qin,Yuchi Huo,Jiming Chen,Qi Ye
关键词-EN: Recent advancements, human body reconstruction, technology and deep, deep learning, learning have led
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in sensor technology and deep learning have led to significant progress in 3D human body reconstruction. However, most existing approaches rely on data from a specific sensor, which can be unreliable due to the inherent limitations of individual sensing modalities. On the other hand, existing multi-modal fusion methods generally require customized designs based on the specific sensor combinations or setups, which limits the flexibility and generality of these methods. Furthermore, conventional point-image projection-based and Transformer-based fusion networks are susceptible to the influence of noisy modalities and sensor poses. To address these limitations and achieve robust 3D human body reconstruction in various conditions, we propose AdaptiveFusion, a generic adaptive multi-modal multi-view fusion framework that can effectively incorporate arbitrary combinations of uncalibrated sensor inputs. By treating different modalities from various viewpoints as equal tokens, and our handcrafted modality sampling module by leveraging the inherent flexibility of Transformer models, AdaptiveFusion is able to cope with arbitrary numbers of inputs and accommodate noisy modalities with only a single training network. Extensive experiments on large-scale human datasets demonstrate the effectiveness of AdaptiveFusion in achieving high-quality 3D human body reconstruction in various environments. In addition, our method achieves superior accuracy compared to state-of-the-art fusion methods.

[CV-122] Deep Computer Vision for Solar Physics Big Data: Opportunities and Challenges

链接: https://arxiv.org/abs/2409.04850
作者: Bo Shen,Marco Marena,Chenyang Li,Qin Li,Haodi Jiang,Mengnan Du,Jiajun Xu,Haimin Wang
关键词-EN: Solar Dynamics Observatory, Parker Solar Probe, Inouye Solar Telescope, made solar physics, solar physics enter
类目: Computer Vision and Pattern Recognition (cs.CV); Solar and Stellar Astrophysics (astro-ph.SR)
*备注:

点击查看摘要

Abstract:With recent missions such as advanced space-based observatories like the Solar Dynamics Observatory (SDO) and Parker Solar Probe, and ground-based telescopes like the Daniel K. Inouye Solar Telescope (DKIST), the volume, velocity, and variety of data have made solar physics enter a transformative era as solar physics big data (SPBD). With the recent advancement of deep computer vision, there are new opportunities in SPBD for tackling problems that were previously unsolvable. However, there are new challenges arising due to the inherent characteristics of SPBD and deep computer vision models. This vision paper presents an overview of the different types of SPBD, explores new opportunities in applying deep computer vision to SPBD, highlights the unique challenges, and outlines several potential future research directions.

[CV-123] Rethinking The Training And Evaluation of Rich-Context Layout-to-Image Generation

链接: https://arxiv.org/abs/2409.04847
作者: Jiaxin Cheng,Zixu Zhao,Tong He,Tianjun Xiao,Yicong Zhou,Zheng Zhang
关键词-EN: Recent advancements, image editing, video editing, enabling a wide, completion and video
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in generative models have significantly enhanced their capacity for image generation, enabling a wide range of applications such as image editing, completion and video editing. A specialized area within generative modeling is layout-to-image (L2I) generation, where predefined layouts of objects guide the generative process. In this study, we introduce a novel regional cross-attention module tailored to enrich layout-to-image generation. This module notably improves the representation of layout regions, particularly in scenarios where existing methods struggle with highly complex and detailed textual descriptions. Moreover, while current open-vocabulary L2I methods are trained in an open-set setting, their evaluations often occur in closed-set environments. To bridge this gap, we propose two metrics to assess L2I performance in open-vocabulary scenarios. Additionally, we conduct a comprehensive user study to validate the consistency of these metrics with human preferences.

[CV-124] POINTS: Improving Your Vision-language Model with Affordable Strategies

链接: https://arxiv.org/abs/2409.04828
作者: Yuan Liu,Zhongyin Zhao,Ziyuan Zhuang,Le Tian,Xiao Zhou,Jie Zhou
关键词-EN: made significant strides, optical character recognition, significant strides, excelling in tasks, geometric problem-solving
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:In recent years, vision-language models have made significant strides, excelling in tasks like optical character recognition and geometric problem-solving. However, several critical issues remain: 1) Proprietary models often lack transparency about their architectures, while open-source models need more detailed ablations of their training strategies. 2) Pre-training data in open-source works is under-explored, with datasets added empirically, making the process cumbersome. 3) Fine-tuning often focuses on adding datasets, leading to diminishing returns. To address these issues, we propose the following contributions: 1) We trained a robust baseline model using the latest advancements in vision-language models, introducing effective improvements and conducting comprehensive ablation and validation for each technique. 2) Inspired by recent work on large language models, we filtered pre-training data using perplexity, selecting the lowest perplexity data for training. This approach allowed us to train on a curated 1M dataset, achieving competitive performance. 3) During visual instruction tuning, we used model soup on different datasets when adding more datasets yielded marginal improvements. These innovations resulted in a 9B parameter model that performs competitively with state-of-the-art models. Our strategies are efficient and lightweight, making them easily adoptable by the community.

[CV-125] Metadata augmented deep neural networks for wild animal classification

链接: https://arxiv.org/abs/2409.04825
作者: Aslak Tøn,Ammar Ahmed,Ali Shariq Imran,Mohib Ullah,R. Muhammad Atif Azad
关键词-EN: Camera trap imagery, Camera trap, contemporary wildlife surveillance, enabling researchers, trap imagery
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Camera trap imagery has become an invaluable asset in contemporary wildlife surveillance, enabling researchers to observe and investigate the behaviors of wild animals. While existing methods rely solely on image data for classification, this may not suffice in cases of suboptimal animal angles, lighting, or image quality. This study introduces a novel approach that enhances wild animal classification by combining specific metadata (temperature, location, time, etc) with image data. Using a dataset focused on the Norwegian climate, our models show an accuracy increase from 98.4% to 98.9% compared to existing methods. Notably, our approach also achieves high accuracy with metadata-only classification, highlighting its potential to reduce reliance on image quality. This work paves the way for integrated systems that advance wildlife classification technology.

[CV-126] FreeAugment: Data Augmentation Search Across All Degrees of Freedom ECCV2024

链接: https://arxiv.org/abs/2409.04820
作者: Tom Bekor,Niv Nayman,Lihi Zelnik-Manor
关键词-EN: automatic data augmentation, data augmentation search, Data augmentation, neural networks, integral part
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:Data augmentation has become an integral part of deep learning, as it is known to improve the generalization capabilities of neural networks. Since the most effective set of image transformations differs between tasks and domains, automatic data augmentation search aims to alleviate the extreme burden of manually finding the optimal image transformations. However, current methods are not able to jointly optimize all degrees of freedom: (1) the number of transformations to be applied, their (2) types, (3) order, and (4) magnitudes. Many existing methods risk picking the same transformation more than once, limit the search to two transformations only, or search for the number of transformations exhaustively or iteratively in a myopic manner. Our approach, FreeAugment, is the first to achieve global optimization of all four degrees of freedom simultaneously, using a fully differentiable method. It efficiently learns the number of transformations and a probability distribution over their permutations, inherently refraining from redundant repetition while sampling. Our experiments demonstrate that this joint learning of all degrees of freedom significantly improves performance, achieving state-of-the-art results on various natural image benchmarks and beyond across other domains. Project page at this https URL

[CV-127] op-GAP: Integrating Size Priors in CNNs for more Interpretability Robustness and Bias Mitigation ECCV2024

链接: https://arxiv.org/abs/2409.04819
作者: Lars Nieradzik,Henrike Stephani,Janis Keuper
关键词-EN: convolutional neural networks, paper introduces Top-GAP, Effective Receptive Field, paper introduces, regularization technique
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: eXCV Workshop at ECCV 2024

点击查看摘要

Abstract:This paper introduces Top-GAP, a novel regularization technique that enhances the explainability and robustness of convolutional neural networks. By constraining the spatial size of the learned feature representation, our method forces the network to focus on the most salient image regions, effectively reducing background influence. Using adversarial attacks and the Effective Receptive Field, we show that Top-GAP directs more attention towards object pixels rather than the background. This leads to enhanced interpretability and robustness. We achieve over 50% robust accuracy on CIFAR-10 with PGD \epsilon=\frac8255 and 20 iterations while maintaining the original clean accuracy. Furthermore, we see increases of up to 5% accuracy against distribution shifts. Our approach also yields more precise object localization, as evidenced by up to 25% improvement in Intersection over Union (IOU) compared to methods like GradCAM and Recipro-CAM.

[CV-128] SSFam: Scribble Supervised Salient Object Detection Family

链接: https://arxiv.org/abs/2409.04817
作者: Zhengyi Liu,Sheng Deng,Xinrui Wang,Linbo Wang,Xianyong Fang,Bin Tang
关键词-EN: salient object detection, supervised salient object, sparse scribble labels, object detection, constructs segmentation ability
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by TMM 2024

点击查看摘要

Abstract:Scribble supervised salient object detection (SSSOD) constructs segmentation ability of attractive objects from surroundings under the supervision of sparse scribble labels. For the better segmentation, depth and thermal infrared modalities serve as the supplement to RGB images in the complex scenes. Existing methods specifically design various feature extraction and multi-modal fusion strategies for RGB, RGB-Depth, RGB-Thermal, and Visual-Depth-Thermal image input respectively, leading to similar model flood. As the recently proposed Segment Anything Model (SAM) possesses extraordinary segmentation and prompt interactive capability, we propose an SSSOD family based on SAM, named SSFam, for the combination input with different modalities. Firstly, different modal-aware modulators are designed to attain modal-specific knowledge which cooperates with modal-agnostic information extracted from the frozen SAM encoder for the better feature ensemble. Secondly, a siamese decoder is tailored to bridge the gap between the training with scribble prompt and the testing with no prompt for the stronger decoding ability. Our model demonstrates the remarkable performance among combinations of different modalities and refreshes the highest level of scribble supervised methods and comes close to the ones of fully supervised methods. this https URL

[CV-129] Power Line Aerial Image Restoration under dverse Weather: Datasets and Baselines

链接: https://arxiv.org/abs/2409.04812
作者: Sai Yang,Bin Hu,Bojun Zhou,Fan Liu,Xiaoxin Wu,Xinsong Zhang,Juping Gu,Jun Zhou
关键词-EN: Line Autonomous Inspection, Power Line Aerial, Line Aerial Image, Power Line Autonomous, Autonomous Inspection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Power Line Autonomous Inspection (PLAI) plays a crucial role in the construction of smart grids due to its great advantages of low cost, high efficiency, and safe operation. PLAI is completed by accurately detecting the electrical components and defects in the aerial images captured by Unmanned Aerial Vehicles (UAVs). However, the visible quality of aerial images is inevitably degraded by adverse weather like haze, rain, or snow, which are found to drastically decrease the detection accuracy in our research. To circumvent this problem, we propose a new task of Power Line Aerial Image Restoration under Adverse Weather (PLAIR-AW), which aims to recover clean and high-quality images from degraded images with bad weather thus improving detection performance for PLAI. In this context, we are the first to release numerous corresponding datasets, namely, HazeCPLID, HazeTTPLA, HazeInsPLAD for power line aerial image dehazing, RainCPLID, RainTTPLA, RainInsPLAD for power line aerial image deraining, SnowCPLID, SnowInsPLAD for power line aerial image desnowing, which are synthesized upon the public power line aerial image datasets of CPLID, TTPLA, InsPLAD following the mathematical models. Meanwhile, we select numerous state-of-the-art methods from image restoration community as the baseline methods for PLAIR-AW. At last, we conduct large-scale empirical experiments to evaluate the performance of baseline methods on the proposed datasets. The proposed datasets and trained models are available at this https URL.

[CV-130] SpotActor: Training-Free Layout-Controlled Consistent Image Generation

链接: https://arxiv.org/abs/2409.04801
作者: Jiahao Wang,Caixia Yan,Weizhan Zhang,Haonan Lin,Mengmeng Wang,Guang Dai,Tieliang Gong,Hao Sun,Jingdong Wang
关键词-EN: diffusion models significantly, models significantly enhance, high-fidelity image generation, diffusion models, models significantly
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Text-to-image diffusion models significantly enhance the efficiency of artistic creation with high-fidelity image generation. However, in typical application scenarios like comic book production, they can neither place each subject into its expected spot nor maintain the consistent appearance of each subject across images. For these issues, we pioneer a novel task, Layout-to-Consistent-Image (L2CI) generation, which produces consistent and compositional images in accordance with the given layout conditions and text prompts. To accomplish this challenging task, we present a new formalization of dual energy guidance with optimization in a dual semantic-latent space and thus propose a training-free pipeline, SpotActor, which features a layout-conditioned backward update stage and a consistent forward sampling stage. In the backward stage, we innovate a nuanced layout energy function to mimic the attention activations with a sigmoid-like objective. While in the forward stage, we design Regional Interconnection Self-Attention (RISA) and Semantic Fusion Cross-Attention (SFCA) mechanisms that allow mutual interactions across images. To evaluate the performance, we present ActorBench, a specified benchmark with hundreds of reasonable prompt-box pairs stemming from object detection datasets. Comprehensive experiments are conducted to demonstrate the effectiveness of our method. The results prove that SpotActor fulfills the expectations of this task and showcases the potential for practical applications with superior layout alignment, subject consistency, prompt conformity and background diversity.

[CV-131] Enhancing Outlier Knowledge for Few-Shot Out-of-Distribution Detection with Extensible Local Prompts

链接: https://arxiv.org/abs/2409.04796
作者: Fanhu Zeng,Zhen Cheng,Fei Zhu,Xu-Yao Zhang
关键词-EN: OOD detection, aiming to distinguish, practical scenarios, enhancing OOD detection, gained prominence
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Out-of-Distribution (OOD) detection, aiming to distinguish outliers from known categories, has gained prominence in practical scenarios. Recently, the advent of vision-language models (VLM) has heightened interest in enhancing OOD detection for VLM through few-shot tuning. However, existing methods mainly focus on optimizing global prompts, ignoring refined utilization of local information with regard to outliers. Motivated by this, we freeze global prompts and introduce a novel coarse-to-fine tuning paradigm to emphasize regional enhancement with local prompts. Our method comprises two integral components: global prompt guided negative augmentation and local prompt enhanced regional regularization. The former utilizes frozen, coarse global prompts as guiding cues to incorporate negative augmentation, thereby leveraging local outlier knowledge. The latter employs trainable local prompts and a regional regularization to capture local information effectively, aiding in outlier identification. We also propose regional-related metric to empower the enrichment of OOD detection. Moreover, since our approach explores enhancing local prompts only, it can be seamlessly integrated with trained global prompts during inference to boost the performance. Comprehensive experiments demonstrate the effectiveness and potential of our method. Notably, our method reduces average FPR95 by 5.17% against state-of-the-art method in 4-shot tuning on challenging ImageNet-1k dataset, even outperforming 16-shot results of previous methods.

[CV-132] Medical Image Segmentation via Single-Source Domain Generalization with Random Amplitude Spectrum Synthesis

链接: https://arxiv.org/abs/2409.04768
作者: Qiang Qiao,Wenyu Wang,Meixia Qu,Kun Su,Bin Jiang,Qiang Guo
关键词-EN: Amplitude Spectrum Synthesis, Random Amplitude Spectrum, clinical datasets, shifts in clinical, Amplitude Spectrum
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 4 figures, Medical Image Computing and Computer Assisted Intervention 2024

点击查看摘要

Abstract:The field of medical image segmentation is challenged by domain generalization (DG) due to domain shifts in clinical datasets. The DG challenge is exacerbated by the scarcity of medical data and privacy concerns. Traditional single-source domain generalization (SSDG) methods primarily rely on stacking data augmentation techniques to minimize domain discrepancies. In this paper, we propose Random Amplitude Spectrum Synthesis (RASS) as a training augmentation for medical images. RASS enhances model generalization by simulating distribution changes from a frequency perspective. This strategy introduces variability by applying amplitude-dependent perturbations to ensure broad coverage of potential domain variations. Furthermore, we propose random mask shuffle and reconstruction components, which can enhance the ability of the backbone to process structural information and increase resilience intra- and cross-domain changes. The proposed Random Amplitude Spectrum Synthesis for Single-Source Domain Generalization (RAS^4DG) is validated on 3D fetal brain images and 2D fundus photography, and achieves an improved DG segmentation performance compared to other SSDG models.

[CV-133] Cross-Dataset Gaze Estimation by Evidential Inter-intra Fusion ACM-MM2024

链接: https://arxiv.org/abs/2409.04766
作者: Shijing Wang,Yaping Huang,Jun Xie,YiTian,Feng Chen,Zhepeng Wang
关键词-EN: environments remains challenging, Achieving accurate, diverse environments remains, reliable gaze predictions, remains challenging
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: This paper was previously submitted to ACM MM 2024

点击查看摘要

Abstract:Achieving accurate and reliable gaze predictions in complex and diverse environments remains challenging. Fortunately, it is straightforward to access diverse gaze datasets in real-world applications. We discover that training these datasets jointly can significantly improve the generalization of gaze estimation, which is overlooked in previous works. However, due to the inherent distribution shift across different datasets, simply mixing multiple dataset decreases the performance in the original domain despite gaining better generalization abilities. To address the problem of ``cross-dataset gaze estimation’', we propose a novel Evidential Inter-intra Fusion EIF framework, for training a cross-dataset model that performs well across all source and unseen domains. Specifically, we build independent single-dataset branches for various datasets where the data space is partitioned into overlapping subspaces within each dataset for local regression, and further create a cross-dataset branch to integrate the generalizable features from single-dataset branches. Furthermore, evidential regressors based on the Normal and Inverse-Gamma (NIG) distribution are designed to additionally provide uncertainty estimation apart from predicting gaze. Building upon this foundation, our proposed framework achieves both intra-evidential fusion among multiple local regressors within each dataset and inter-evidential fusion among multiple branches by Mixture \textbfof Normal Inverse-Gamma (MoNIG distribution. Experiments demonstrate that our method consistently achieves notable improvements in both source domains and unseen domains.

[CV-134] raining-Free Point Cloud Recognition Based on Geometric and Semantic Information Fusion

链接: https://arxiv.org/abs/2409.04760
作者: Yan Chen,Di Huang,Zhichao Liao,Xi Cheng,Xinghui Li,Lone Zeng
关键词-EN: increasingly popular due, time costs, point cloud recognition, trend of employing, significant reduction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The trend of employing training-free methods for point cloud recognition is becoming increasingly popular due to its significant reduction in computational resources and time costs. However, existing approaches are limited as they typically extract either geometric or semantic features. To address this limitation, we propose a novel method that integrates both geometric and semantic features, thereby enhancing the comprehensive understanding of point clouds. For the geometric branch, we adopt a non-parametric strategy to extract geometric features. In the semantic branch, we leverage a model pre-trained through contrastive learning and aligned with text features to obtain semantic features. Experimental results demonstrate that our method outperforms existing state-of-the-art training-free approaches on several popular benchmark datasets, including ModelNet and ScanObiectNN.

[CV-135] Adaptative Context Normalization: A Boost for Deep Learning in Image Processing

链接: https://arxiv.org/abs/2409.04759
作者: Bilal Faye,Hanane Azzag,Mustapha Lebbah,Djamel Bouchaffra
关键词-EN: Deep Neural network, Neural network learning, Deep Neural, faces major challenges, major challenges related
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2403.16798

点击查看摘要

Abstract:Deep Neural network learning for image processing faces major challenges related to changes in distribution across layers, which disrupt model convergence and performance. Activation normalization methods, such as Batch Normalization (BN), have revolutionized this field, but they rely on the simplified assumption that data distribution can be modelled by a single Gaussian distribution. To overcome these limitations, Mixture Normalization (MN) introduced an approach based on a Gaussian Mixture Model (GMM), assuming multiple components to model the data. However, this method entails substantial computational requirements associated with the use of Expectation-Maximization algorithm to estimate parameters of each Gaussian components. To address this issue, we introduce Adaptative Context Normalization (ACN), a novel supervised approach that introduces the concept of “context”, which groups together a set of data with similar characteristics. Data belonging to the same context are normalized using the same parameters, enabling local representation based on contexts. For each context, the normalized parameters, as the model weights are learned during the backpropagation phase. ACN not only ensures speed, convergence, and superior performance compared to BN and MN but also presents a fresh perspective that underscores its particular efficacy in the field of image processing.

[CV-136] SGSeg: Enabling Text-free Inference in Language-guided Segmentation of Chest X-rays via Self-guidance

链接: https://arxiv.org/abs/2409.04758
作者: Shuchang Ye,Mingyuan Meng,Mingjian Li,Dagan Feng,Jinman Kim
关键词-EN: chest X-rays, infected areas, pivotal for facilitating, facilitating the accurate, accurate delineation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This preprint has not undergone peer review or any post-submission improvments or corrections

点击查看摘要

Abstract:Segmentation of infected areas in chest X-rays is pivotal for facilitating the accurate delineation of pulmonary structures and pathological anomalies. Recently, multi-modal language-guided image segmentation methods have emerged as a promising solution for chest X-rays where the clinical text reports, depicting the assessment of the images, are used as guidance. Nevertheless, existing language-guided methods require clinical reports alongside the images, and hence, they are not applicable for use in image segmentation in a decision support context, but rather limited to retrospective image analysis after clinical reporting has been completed. In this study, we propose a self-guided segmentation framework (SGSeg) that leverages language guidance for training (multi-modal) while enabling text-free inference (uni-modal), which is the first that enables text-free inference in language-guided segmentation. We exploit the critical location information of both pulmonary and pathological structures depicted in the text reports and introduce a novel localization-enhanced report generation (LERG) module to generate clinical reports for self-guidance. Our LERG integrates an object detector and a location-based attention aggregator, weakly-supervised by a location-aware pseudo-label extraction module. Extensive experiments on a well-benchmarked QaTa-COV19 dataset demonstrate that our SGSeg achieved superior performance than existing uni-modal segmentation methods and closely matched the state-of-the-art performance of multi-modal language-guided segmentation methods.

[CV-137] Fisheye-GS: Lightweight and Extensible Gaussian Splatting Module for Fisheye Cameras

链接: https://arxiv.org/abs/2409.04751
作者: Zimu Liao,Siyan Chen,Rong Fu,Yi Wang,Zhongling Su,Hao Luo,Linning Xu,Bo Dai,Hengjie Li,Zhilin Pei,Xingcheng Zhang
关键词-EN: Gaussian Splatting, garnered attention, high fidelity, fidelity and real-time, Recently
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Recently, 3D Gaussian Splatting (3DGS) has garnered attention for its high fidelity and real-time rendering. However, adapting 3DGS to different camera models, particularly fisheye lenses, poses challenges due to the unique 3D to 2D projection calculation. Additionally, there are inefficiencies in the tile-based splatting, especially for the extreme curvature and wide field of view of fisheye lenses, which are crucial for its broader real-life applications. To tackle these challenges, we introduce Fisheye-GS.This innovative method recalculates the projection transformation and its gradients for fisheye cameras. Our approach can be seamlessly integrated as a module into other efficient 3D rendering methods, emphasizing its extensibility, lightweight nature, and modular design. Since we only modified the projection component, it can also be easily adapted for use with different camera models. Compared to methods that train after undistortion, our approach demonstrates a clear improvement in visual quality.

[CV-138] raining-Free Style Consistent Image Synthesis with Condition and Mask Guidance in E-Commerce

链接: https://arxiv.org/abs/2409.04750
作者: Guandong Li
关键词-EN: achieved excellent results, diffusion models, common task, largely based, based on diffusion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Generating style-consistent images is a common task in the e-commerce field, and current methods are largely based on diffusion models, which have achieved excellent results. This paper introduces the concept of the QKV (query/key/value) level, referring to modifications in the attention maps (self-attention and cross-attention) when integrating UNet with image conditions. Without disrupting the product’s main composition in e-commerce images, we aim to use a train-free method guided by pre-set conditions. This involves using shared KV to enhance similarity in cross-attention and generating mask guidance from the attention map to cleverly direct the generation of style-consistent images. Our method has shown promising results in practical applications.

[CV-139] Explicit Mutual Information Maximization for Self-Supervised Learning

链接: https://arxiv.org/abs/2409.04747
作者: Lele Chang,Peilin Liu,Qinghai Guo,Fei Wen
关键词-EN: self-supervised learning, extensively studied, SSL, Recently, MIM
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, self-supervised learning (SSL) has been extensively studied. Theoretically, mutual information maximization (MIM) is an optimal criterion for SSL, with a strong theoretical foundation in information theory. However, it is difficult to directly apply MIM in SSL since the data distribution is not analytically available in applications. In practice, many existing methods can be viewed as approximate implementations of the MIM criterion. This work shows that, based on the invariance property of MI, explicit MI maximization can be applied to SSL under a generic distribution assumption, i.e., a relaxed condition of the data distribution. We further illustrate this by analyzing the generalized Gaussian distribution. Based on this result, we derive a loss function based on the MIM criterion using only second-order statistics. We implement the new loss for SSL and demonstrate its effectiveness via extensive experiments.

[CV-140] Enhancing Image Authenticity Detection: Swin Transformers and Color Frame Analysis for CGI vs. Real Images

链接: https://arxiv.org/abs/2409.04742
作者: Preeti Mehta,Aman Sagar,Suchi Kumari
关键词-EN: authentic images captured, Swin Transformers, making them increasingly, digital cameras, rapid advancements
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 5 figures, 3 tables

点击查看摘要

Abstract:The rapid advancements in computer graphics have greatly enhanced the quality of computer-generated images (CGI), making them increasingly indistinguishable from authentic images captured by digital cameras (ADI). This indistinguishability poses significant challenges, especially in an era of widespread misinformation and digitally fabricated content. This research proposes a novel approach to classify CGI and ADI using Swin Transformers and preprocessing techniques involving RGB and CbCrY color frame analysis. By harnessing the capabilities of Swin Transformers, our method foregoes handcrafted features instead of relying on raw pixel data for model training. This approach achieves state-of-the-art accuracy while offering substantial improvements in processing speed and robustness against joint image manipulations such as noise addition, blurring, and JPEG compression. Our findings highlight the potential of Swin Transformers combined with advanced color frame analysis for effective and efficient image authenticity detection.

[CV-141] Swin Transformer for Robust Differentiation of Real and Synthetic Images: Intra- and Inter-Dataset Analysis

链接: https://arxiv.org/abs/2409.04734
作者: Preetu Mehta,Aman Sagar,Suchi Kumari
关键词-EN: RGB color space, Swin Transformer-based model, RGB color, distinguishing computer-generated imagery, Swin Transformer-based
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 4 figures, 3 tables

点击查看摘要

Abstract:\textbfPurpose This study aims to address the growing challenge of distinguishing computer-generated imagery (CGI) from authentic digital images in the RGB color space. Given the limitations of existing classification methods in handling the complexity and variability of CGI, this research proposes a Swin Transformer-based model for accurate differentiation between natural and synthetic images. \textbfMethods The proposed model leverages the Swin Transformer’s hierarchical architecture to capture local and global features crucial for distinguishing CGI from natural images. The model’s performance was evaluated through intra-dataset and inter-dataset testing across three distinct datasets: CiFAKE, JSSSTU, and Columbia. The datasets were tested individually (D1, D2, D3) and in combination (D1+D2+D3) to assess the model’s robustness and domain generalization capabilities. \textbfResults The Swin Transformer-based model demonstrated high accuracy, consistently achieving a range of 97-99% across all datasets and testing scenarios. These results confirm the model’s effectiveness in detecting CGI, showcasing its robustness and reliability in both intra-dataset and inter-dataset evaluations. \textbfConclusion The findings of this study highlight the Swin Transformer model’s potential as an advanced tool for digital image forensics, particularly in distinguishing CGI from natural images. The model’s strong performance across multiple datasets indicates its capability for domain generalization, making it a valuable asset in scenarios requiring precise and reliable image classification. Comments: 12 pages, 4 figures, 3 tables Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.04734 [cs.CV] (or arXiv:2409.04734v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.04734 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Suchi Kumari Dr [view email] [v1] Sat, 7 Sep 2024 06:43:17 UTC (17,133 KB)

[CV-142] VidLPRO: A underlineVideo-underlineLanguage underlinePre-training Framework for underlineRobotic and Laparoscopic Surgery

链接: https://arxiv.org/abs/2409.04732
作者: Mohammadmahdi Honarmand,Muhammad Abdullah Jamal,Omid Mohareri
关键词-EN: pre-training framework designed, framework designed specifically, laparoscopic surgery, designed specifically, specifically for robotic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce VidLPRO, a novel video-language (VL) pre-training framework designed specifically for robotic and laparoscopic surgery. While existing surgical VL models primarily rely on contrastive learning, we propose a more comprehensive approach to capture the intricate temporal dynamics and align video with language. VidLPRO integrates video-text contrastive learning, video-text matching, and masked language modeling objectives to learn rich VL representations. To support this framework, we present GenSurg+, a carefully curated dataset derived from GenSurgery, comprising 17k surgical video clips paired with captions generated by GPT-4 using transcripts extracted by the Whisper model. This dataset addresses the need for large-scale, high-quality VL data in the surgical domain. Extensive experiments on benchmark datasets, including Cholec80 and AutoLaparo, demonstrate the efficacy of our approach. VidLPRO achieves state-of-the-art performance in zero-shot surgical phase recognition, significantly outperforming existing surgical VL models such as SurgVLP and HecVL. Our model demonstrates improvements of up to 21.5% in accuracy and 15.7% in F1 score, setting a new benchmark in the field. Notably, VidLPRO exhibits robust performance even with single-frame inference, while effectively scaling with increased temporal context. Ablation studies reveal the impact of frame sampling strategies on model performance and computational efficiency. These results underscore VidLPRO’s potential as a foundation model for surgical video understanding.

[CV-143] Cross-Organ Domain Adaptive Neural Network for Pancreatic Endoscopic Ultrasound Image Segmentation

链接: https://arxiv.org/abs/2409.04718
作者: ZhiChao Yan,Hui Xue,Yi Zhu,Bin Xiao,Hao Yuan
关键词-EN: effective diagnosis, pancreatic EUS images, EUS images, universal network, crisp EUS images
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate segmentation of lesions in pancreatic endoscopic ultrasound (EUS) images is crucial for effective diagnosis and treatment. However, the collection of enough crisp EUS images for effective diagnosis is arduous. Recently, domain adaptation (DA) has been employed to address these challenges by leveraging related knowledge from other domains. Most DA methods only focus on multi-view representations of the same organ, which makes it still tough to clearly depict the tumor lesion area with limited semantic information. Although transferring homogeneous similarity from different organs could benefit the issue, there is a lack of relevant work due to the enormous domain gap between them. To address these challenges, we propose the Cross-Organ Tumor Segmentation Networks (COTS-Nets), consisting of a universal network and an auxiliary network. The universal network utilizes boundary loss to learn common boundary information of different tumors, enabling accurate delineation of tumors in EUS despite limited and low-quality data. Simultaneously, we incorporate consistency loss in the universal network to align the prediction of pancreatic EUS with tumor boundaries from other organs to mitigate the domain gap. To further reduce the cross-organ domain gap, the auxiliary network integrates multi-scale features from different organs, aiding the universal network in acquiring domain-invariant knowledge. Systematic experiments demonstrate that COTS-Nets significantly improves the accuracy of pancreatic cancer diagnosis. Additionally, we developed the Pancreatic Cancer Endoscopic Ultrasound (PCEUS) dataset, comprising 501 pathologically confirmed pancreatic EUS images, to facilitate model development.

[CV-144] Unleashing the Power of Generic Segmentation Models: A Simple Baseline for Infrared Small Target Detection ACM-MM’24

链接: https://arxiv.org/abs/2409.04714
作者: Mingjin Zhang,Chi Zhang,Qiming Zhang,Yunsong Li,Xinbo Gao,Jing Zhang
关键词-EN: Recent advancements, advancements in deep, deep learning, learning have greatly, greatly advanced
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ACM MM’24

点击查看摘要

Abstract:Recent advancements in deep learning have greatly advanced the field of infrared small object detection (IRSTD). Despite their remarkable success, a notable gap persists between these IRSTD methods and generic segmentation approaches in natural image domains. This gap primarily arises from the significant modality differences and the limited availability of infrared data. In this study, we aim to bridge this divergence by investigating the adaptation of generic segmentation models, such as the Segment Anything Model (SAM), to IRSTD tasks. Our investigation reveals that many generic segmentation models can achieve comparable performance to state-of-the-art IRSTD methods. However, their full potential in IRSTD remains untapped. To address this, we propose a simple, lightweight, yet effective baseline model for segmenting small infrared objects. Through appropriate distillation strategies, we empower smaller student models to outperform state-of-the-art methods, even surpassing fine-tuned teacher results. Furthermore, we enhance the model’s performance by introducing a novel query design comprising dense and sparse queries to effectively encode multi-scale features. Through extensive experimentation across four popular IRSTD datasets, our model demonstrates significantly improved performance in both accuracy and throughput compared to existing approaches, surpassing SAM and Semantic-SAM by over 14 IoU on NUDT and 4 IoU on IRSTD1k. The source code and models will be released at this https URL.

[CV-145] Dual-stream Feature Augmentation for Domain Generalization

链接: https://arxiv.org/abs/2409.04699
作者: Shanshan Wang,ALuSi,Xun Yang,Ke Xu,Huibin Tan,Xingyi Zhang
关键词-EN: task aims, aims to learn, learn a robust, features, model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Domain generalization (DG) task aims to learn a robust model from source domains that could handle the out-of-distribution (OOD) issue. In order to improve the generalization ability of the model in unseen domains, increasing the diversity of training samples is an effective solution. However, existing augmentation approaches always have some limitations. On the one hand, the augmentation manner in most DG methods is not enough as the model may not see the perturbed features in approximate the worst case due to the randomness, thus the transferability in features could not be fully explored. On the other hand, the causality in discriminative features is not involved in these methods, which harms the generalization ability of model due to the spurious correlations. To address these issues, we propose a Dual-stream Feature Augmentation~(DFA) method by constructing some hard features from two perspectives. Firstly, to improve the transferability, we construct some targeted features with domain related augmentation manner. Through the guidance of uncertainty, some hard cross-domain fictitious features are generated to simulate domain shift. Secondly, to take the causality into consideration, the spurious correlated non-causal information is disentangled by an adversarial mask, then the more discriminative features can be extracted through these hard causal related information. Different from previous fixed synthesizing strategy, the two augmentations are integrated into a unified learnable feature disentangle model. Based on these hard features, contrastive learning is employed to keep the semantic consistency and improve the robustness of the model. Extensive experiments on several datasets demonstrated that our approach could achieve state-of-the-art performance for domain generalization. Our code is available at: this https URL.

[CV-146] C2F-CHART: A Curriculum Learning Approach to Chart Classification ICPR

链接: https://arxiv.org/abs/2409.04683
作者: Nour Shaheen,Tamer Elsharnouby,Marwan Torki
关键词-EN: visually representing data, scientific research, representing data, visually representing, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper has been accepted for publication in the proceedings of the 2024 International Conference on Pattern Recognition (ICPR)

点击查看摘要

Abstract:In scientific research, charts are usually the primary method for visually representing data. However, the accessibility of charts remains a significant concern. In an effort to improve chart understanding pipelines, we focus on optimizing the chart classification component. We leverage curriculum learning, which is inspired by the human learning process. In this paper, we introduce a novel training approach for chart classification that utilizes coarse-to-fine curriculum learning. Our approach, which we name C2F-CHART (for coarse-to-fine) exploits inter-class similarities to create learning tasks of varying difficulty levels. We benchmark our method on the ICPR 2022 CHART-Infographics UB UNITEC PMC dataset, outperforming the state-of-the-art results.

[CV-147] Neural Augmentation Based Panoramic High Dynamic Range Stitching

链接: https://arxiv.org/abs/2409.04679
作者: Chaobing Zheng,Yilun Xu,Weihai Chen,Shiqian Wu,Zhengguo Li
关键词-EN: low dynamic range, high dynamic range, inputting low dynamic, dynamic range, geometrically synchronized LDR
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages

点击查看摘要

Abstract:Due to saturated regions of inputting low dynamic range (LDR) images and large intensity changes among the LDR images caused by different exposures, it is challenging to produce an information enriched panoramic LDR image without visual artifacts for a high dynamic range (HDR) scene through stitching multiple geometrically synchronized LDR images with different exposures and pairwise overlapping fields of views (OFOVs). Fortunately, the stitching of such images is innately a perfect scenario for the fusion of a physics-driven approach and a data-driven approach due to their OFOVs. Based on this new insight, a novel neural augmentation based panoramic HDR stitching algorithm is proposed in this paper. The physics-driven approach is built up using the OFOVs. Different exposed images of each view are initially generated by using the physics-driven approach, are then refined by a data-driven approach, and are finally used to produce panoramic LDR images with different exposures. All the panoramic LDR images with different exposures are combined together via a multi-scale exposure fusion algorithm to produce the final panoramic LDR image. Experimental results demonstrate the proposed algorithm outperforms existing panoramic stitching algorithms.

[CV-148] Multi-Conditioned Denoising Diffusion Probabilistic Model (mDDPM) for Medical Image Synthesis

链接: https://arxiv.org/abs/2409.04670
作者: Arjun Krishna,Ge Wang,Klaus Mueller
关键词-EN: Medical imaging applications, specialized in terms, terms of human, Medical imaging, Denoising Diffusion Probabilistic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Medical imaging applications are highly specialized in terms of human anatomy, pathology, and imaging domains. Therefore, annotated training datasets for training deep learning applications in medical imaging not only need to be highly accurate but also diverse and large enough to encompass almost all plausible examples with respect to those specifications. We argue that achieving this goal can be facilitated through a controlled generation framework for synthetic images with annotations, requiring multiple conditional specifications as input to provide control. We employ a Denoising Diffusion Probabilistic Model (DDPM) to train a large-scale generative model in the lung CT domain and expand upon a classifier-free sampling strategy to showcase one such generation framework. We show that our approach can produce annotated lung CT images that can faithfully represent anatomy, convincingly fooling experts into perceiving them as real. Our experiments demonstrate that controlled generative frameworks of this nature can surpass nearly every state-of-the-art image generative model in achieving anatomical consistency in generated medical images when trained on comparable large medical datasets.

[CV-149] Structure-Invariant Range-Visual-Inertial Odometry IROS

链接: https://arxiv.org/abs/2409.04633
作者: Ivan Alberico,Jeff Delaune,Giovanni Cioffi,Davide Scaramuzza
关键词-EN: Mars Science Helicopter, Valles Marineris, Science Helicopter, targeting landing sites, highly irregular terrain
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: IEEE/RSJ International Conference on Intelligent Robots (IROS), 2024

点击查看摘要

Abstract:The Mars Science Helicopter (MSH) mission aims to deploy the next generation of unmanned helicopters on Mars, targeting landing sites in highly irregular terrain such as Valles Marineris, the largest canyons in the Solar system with elevation variances of up to 8000 meters. Unlike its predecessor, the Mars 2020 mission, which relied on a state estimation system assuming planar terrain, MSH requires a novel approach due to the complex topography of the landing site. This work introduces a novel range-visual-inertial odometry system tailored for the unique challenges of the MSH mission. Our system extends the state-of-the-art xVIO framework by fusing consistent range information with visual and inertial measurements, preventing metric scale drift in the absence of visual-inertial excitation (mono camera and constant velocity descent), and enabling landing on any terrain structure, without requiring any planar terrain assumption. Through extensive testing in image-based simulations using actual terrain structure and textures collected in Mars orbit, we demonstrate that our range-VIO approach estimates terrain-relative velocity meeting the stringent mission requirements, and outperforming existing methods.

[CV-150] Self-Supervised Contrastive Learning for Videos using Differentiable Local Alignment BMVC

链接: https://arxiv.org/abs/2409.04607
作者: Keyne Oei,Amr Gomaa,Anna Maria Feit,João Belo
关键词-EN: Robust frame-wise embeddings, Robust frame-wise, perform video analysis, frame-wise embeddings, embeddings are essential
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in 2nd Workshop on Video Understanding and its Applications, held in conjunction with the British Machine Vision Conference (BMVC) 2024

点击查看摘要

Abstract:Robust frame-wise embeddings are essential to perform video analysis and understanding tasks. We present a self-supervised method for representation learning based on aligning temporal video sequences. Our framework uses a transformer-based encoder to extract frame-level features and leverages them to find the optimal alignment path between video sequences. We introduce the novel Local-Alignment Contrastive (LAC) loss, which combines a differentiable local alignment loss to capture local temporal dependencies with a contrastive loss to enhance discriminative learning. Prior works on video alignment have focused on using global temporal ordering across sequence pairs, whereas our loss encourages identifying the best-scoring subsequence alignment. LAC uses the differentiable Smith-Waterman (SW) affine method, which features a flexible parameterization learned through the training phase, enabling the model to adjust the temporal gap penalty length dynamically. Evaluations show that our learned representations outperform existing state-of-the-art approaches on action recognition tasks.

[CV-151] Multi-scale Feature Fusion with Point Pyramid for 3D Object Detection

链接: https://arxiv.org/abs/2409.04601
作者: Weihao Lu,Dezong Zhao,Cristiano Premebida,Li Zhang,Wenjing Zhao,Daxin Tian
关键词-EN: autonomous driving systems, LiDARbased autonomous driving, Effective point cloud, Effective point, point cloud processing
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: 12 pages

点击查看摘要

Abstract:Effective point cloud processing is crucial to LiDARbased autonomous driving systems. The capability to understand features at multiple scales is required for object detection of intelligent vehicles, where road users may appear in different sizes. Recent methods focus on the design of the feature aggregation operators, which collect features at different scales from the encoder backbone and assign them to the points of interest. While efforts are made into the aggregation modules, the importance of how to fuse these multi-scale features has been overlooked. This leads to insufficient feature communication across scales. To address this issue, this paper proposes the Point Pyramid RCNN (POP-RCNN), a feature pyramid-based framework for 3D object detection on point clouds. POP-RCNN consists of a Point Pyramid Feature Enhancement (PPFE) module to establish connections across spatial scales and semantic depths for information exchange. The PPFE module effectively fuses multi-scale features for rich information without the increased complexity in feature aggregation. To remedy the impact of inconsistent point densities, a point density confidence module is deployed. This design integration enables the use of a lightweight feature aggregator, and the emphasis on both shallow and deep semantics, realising a detection framework for 3D object detection. With great adaptability, the proposed method can be applied to a variety of existing frameworks to increase feature richness, especially for long-distance detection. By adopting the PPFE in the voxel-based and point-voxel-based baselines, experimental results on KITTI and Waymo Open Dataset show that the proposed method achieves remarkable performance even with limited computational headroom.

[CV-152] A Novel Dataset for Video-Based Autism Classification Leveraging Extra-Stimulatory Behavior

链接: https://arxiv.org/abs/2409.04598
作者: Manuel Serna-Aguilera,Xuan Bac Nguyen,Han-Seok Seo,Khoa Luu
关键词-EN: Autism Spectrum Disorder, Autism Spectrum, Spectrum Disorder, degrees of intensity, sensory processing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Autism Spectrum Disorder (ASD) can affect individuals at varying degrees of intensity, from challenges in overall health, communication, and sensory processing, and this often begins at a young age. Thus, it is critical for medical professionals to be able to accurately diagnose ASD in young children, but doing so is difficult. Deep learning can be responsibly leveraged to improve productivity in addressing this task. The availability of data, however, remains a considerable obstacle. Hence, in this work, we introduce the Video ASD dataset–a dataset that contains video frame convolutional and attention map feature data–to foster further progress in the task of ASD classification. The original videos showcase children reacting to chemo-sensory stimuli, among auditory, touch, and vision This dataset contains the features of the frames spanning 2,467 videos, for a total of approximately 1.4 million frames. Additionally, head pose angles are included to account for head movement noise, as well as full-sentence text labels for the taste and smell videos that describe how the facial expression changes before, immediately after, and long after interaction with the stimuli. In addition to providing features, we also test foundation models on this data to showcase how movement noise affects performance and the need for more data and more complex labels.

[CV-153] Influence of Early through Late Fusion on Pancreas Segmentation from Imperfectly Registered Multimodal MRI

链接: https://arxiv.org/abs/2409.04563
作者: Lucas W. Remedios,Han Liu,Samuel W. Remedios,Lianrui Zuo,Adam M. Saunders,Shunxing Bao,Yuankai Huo,Alvin C. Powers,John Virostko,Bennett A. Landman
关键词-EN: Multimodal fusion promises, fusion, Multimodal fusion, Multimodal, Dice score
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13.5 pages of manuscript content

点击查看摘要

Abstract:Multimodal fusion promises better pancreas segmentation. However, where to perform fusion in models is still an open question. It is unclear if there is a best location to fuse information when analyzing pairs of imperfectly aligned images. Two main alignment challenges in this pancreas segmentation study are 1) the pancreas is deformable and 2) breathing deforms the abdomen. Even after image registration, relevant deformations are often not corrected. We examine how early through late fusion impacts pancreas segmentation. We used 353 pairs of T2-weighted (T2w) and T1-weighted (T1w) abdominal MR images from 163 subjects with accompanying pancreas labels. We used image registration (deeds) to align the image pairs. We trained a collection of basic UNets with different fusion points, spanning from early to late, to assess how early through late fusion influenced segmentation performance on imperfectly aligned images. We assessed generalization of fusion points on nnUNet. The single-modality T2w baseline using a basic UNet model had a Dice score of 0.73, while the same baseline on the nnUNet model achieved 0.80. For the basic UNet, the best fusion approach occurred in the middle of the encoder (early/mid fusion), which led to a statistically significant improvement of 0.0125 on Dice score compared to the baseline. For the nnUNet, the best fusion approach was naïve image concatenation before the model (early fusion), which resulted in a statistically significant Dice score increase of 0.0021 compared to baseline. Fusion in specific blocks can improve performance, but the best blocks for fusion are model specific, and the gains are small. In imperfectly registered datasets, fusion is a nuanced problem, with the art of design remaining vital for uncovering potential insights. Future innovation is needed to better address fusion in cases of imperfect alignment of abdominal image pairs.

[CV-154] Dual-Level Cross-Modal Contrastive Clustering

链接: https://arxiv.org/abs/2409.04561
作者: Haixin Zhang,Yongjun Li,Dong Huang
关键词-EN: involves grouping images, clusters without labels, involves grouping, clustering, Image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages,4 figures

点击查看摘要

Abstract:Image clustering, which involves grouping images into different clusters without labels, is a key task in unsupervised learning. Although previous deep clustering methods have achieved remarkable results, they only explore the intrinsic information of the image itself but overlook external supervision knowledge to improve the semantic understanding of images. Recently, visual-language pre-trained model on large-scale datasets have been used in various downstream tasks and have achieved great results. However, there is a gap between visual representation learning and textual semantic learning, and how to properly utilize the representation of two different modalities for clustering is still a big challenge. To tackle the challenges, we propose a novel image clustering framwork, named Dual-level Cross-Modal Contrastive Clustering (DXMC). Firstly, external textual information is introduced for constructing a semantic space which is adopted to generate image-text pairs. Secondly, the image-text pairs are respectively sent to pre-trained image and text encoder to obtain image and text embeddings which subsquently are fed into four well-designed networks. Thirdly, dual-level cross-modal contrastive learning is conducted between discriminative representations of different modalities and distinct level. Extensive experimental results on five benchmark datasets demonstrate the superiority of our proposed method.

[CV-155] Multi-Modal Diffusion for Hand-Object Grasp Generation

链接: https://arxiv.org/abs/2409.04560
作者: Jinkun Cao,Jingyuan Liu,Kris Kitani,Yi Zhou
关键词-EN: Multi-modal Grasp Diffusion, generating hand, generating hand grasp, generating hand poses, hand
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8-page paper, 7-page appendix and 10 pages

点击查看摘要

Abstract:In this work, we focus on generating hand grasp over objects. Compared to previous works of generating hand poses with a given object, we aim to allow the generalization of both hand and object shapes by a single model. Our proposed method Multi-modal Grasp Diffusion (MGD) learns the prior and conditional posterior distribution of both modalities from heterogeneous data sources. Therefore it relieves the limitation of hand-object grasp datasets by leveraging the large-scale 3D object datasets. According to both qualitative and quantitative experiments, both conditional and unconditional generation of hand grasp achieve good visual plausibility and diversity. The proposed method also generalizes well to unseen object shapes. The code and weights will be available at \urlthis https URL.

[CV-156] hinking Outside the BBox: Unconstrained Generative Object Compositing

链接: https://arxiv.org/abs/2409.04559
作者: Gemma Canet Tarrés,Zhe Lin,Zhifei Zhang,Jianming Zhang,Yizhi Song,Dan Ruta,Andrew Gilbert,John Collomosse,Soo Ye Kim
关键词-EN: involves multiple non-trivial, multiple non-trivial sub-tasks, lighting harmonization, geometry adjustment, image involves multiple
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Compositing an object into an image involves multiple non-trivial sub-tasks such as object placement and scaling, color/lighting harmonization, viewpoint/geometry adjustment, and shadow/reflection generation. Recent generative image compositing methods leverage diffusion models to handle multiple sub-tasks at once. However, existing models face limitations due to their reliance on masking the original object during training, which constrains their generation to the input mask. Furthermore, obtaining an accurate input mask specifying the location and scale of the object in a new image can be highly challenging. To overcome such limitations, we define a novel problem of unconstrained generative object compositing, i.e., the generation is not bounded by the mask, and train a diffusion-based model on a synthesized paired dataset. Our first-of-its-kind model is able to generate object effects such as shadows and reflections that go beyond the mask, enhancing image realism. Additionally, if an empty mask is provided, our model automatically places the object in diverse natural locations and scales, accelerating the compositing workflow. Our model outperforms existing object placement and compositing models in various quality metrics and user studies.

[CV-157] SCARF: Scalable Continual Learning Framework for Memory-efficient Multiple Neural Radiance Fields

链接: https://arxiv.org/abs/2409.04482
作者: Yuze Wang,Junyi Wang,Chen Wang,Wantong Duan,Yongtang Bao,Yue Qi
关键词-EN: Neural Radiance Fields, paper introduces, framework for synthesising, updating the network, training data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper introduces a novel continual learning framework for synthesising novel views of multiple scenes, learning multiple 3D scenes incrementally, and updating the network parameters only with the training data of the upcoming new scene. We build on Neural Radiance Fields (NeRF), which uses multi-layer perceptron to model the density and radiance field of a scene as the implicit function. While NeRF and its extensions have shown a powerful capability of rendering photo-realistic novel views in a single 3D scene, managing these growing 3D NeRF assets efficiently is a new scientific problem. Very few works focus on the efficient representation or continuous learning capability of multiple scenes, which is crucial for the practical applications of NeRF. To achieve these goals, our key idea is to represent multiple scenes as the linear combination of a cross-scene weight matrix and a set of scene-specific weight matrices generated from a global parameter generator. Furthermore, we propose an uncertain surface knowledge distillation strategy to transfer the radiance field knowledge of previous scenes to the new model. Representing multiple 3D scenes with such weight matrices significantly reduces memory requirements. At the same time, the uncertain surface distillation strategy greatly overcomes the catastrophic forgetting problem and maintains the photo-realistic rendering quality of previous scenes. Experiments show that the proposed approach achieves state-of-the-art rendering quality of continual learning NeRF on NeRF-Synthetic, LLFF, and TanksAndTemples datasets while preserving extra low storage cost.

[CV-158] A Flexible Framework for Universal Computational Aberration Correction via Automatic Lens Library Generation and Domain Adaptation

链接: https://arxiv.org/abs/2409.05809
作者: Qi Jiang,Yao Gao,Shaohua Gao,Zhonghua Yi,Lei Sun,Hao Shi,Kailun Yang,Kaiwei Wang,Jian Bai
关键词-EN: Computational Aberration Correction, Emerging universal Computational, universal Computational Aberration, CAC, repeated data preparation
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Emerging universal Computational Aberration Correction (CAC) paradigms provide an inspiring solution to light-weight and high-quality imaging without repeated data preparation and model training to accommodate new lens designs. However, the training databases in these approaches, i.e., the lens libraries (LensLibs), suffer from their limited coverage of real-world aberration behaviors. In this work, we set up an OmniLens framework for universal CAC, considering both the generalization ability and flexibility. OmniLens extends the idea of universal CAC to a broader concept, where a base model is trained for three cases, including zero-shot CAC with the pre-trained model, few-shot CAC with a little lens-specific data for fine-tuning, and domain adaptive CAC using domain adaptation for lens-descriptions-unknown lens. In terms of OmniLens’s data foundation, we first propose an Evolution-based Automatic Optical Design (EAOD) pipeline to construct LensLib automatically, coined AODLib, whose diversity is enriched by an evolution framework, with comprehensive constraints and a hybrid optimization strategy for achieving realistic aberration behaviors. For network design, we introduce the guidance of high-quality codebook priors to facilitate zero-shot CAC and few-shot CAC, which enhances the model’s generalization ability, while also boosting its convergence in a few-shot case. Furthermore, based on the statistical observation of dark channel priors in optical degradation, we design an unsupervised regularization term to adapt the base model to the target descriptions-unknown lens using its aberration images without ground truth. We validate OmniLens on 4 manually designed low-end lenses with various structures and aberration behaviors. Remarkably, the base model trained on AODLib exhibits strong generalization capabilities, achieving 97% of the lens-specific performance in a zero-shot setting.

[CV-159] Consensus-based Distributed Quantum Kernel Learning for Speech Recognition

链接: https://arxiv.org/abs/2409.05770
作者: Kuan-Cheng Chen,Wenxuan Ma,Xiaotian Xu
关键词-EN: Quantum Kernel Learning, presents a Consensus-based, Consensus-based Distributed Quantum, quantum computing.CDQKL addresses, Kernel Learning
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a Consensus-based Distributed Quantum Kernel Learning (CDQKL) framework aimed at improving speech recognition through distributed quantum computing.CDQKL addresses the challenges of scalability and data privacy in centralized quantum kernel learning. It does this by distributing computational tasks across quantum terminals, which are connected through classical channels. This approach enables the exchange of model parameters without sharing local training data, thereby maintaining data privacy and enhancing computational efficiency. Experimental evaluations on benchmark speech emotion recognition datasets demonstrate that CDQKL achieves competitive classification accuracy and scalability compared to centralized and local quantum kernel learning models. The distributed nature of CDQKL offers advantages in privacy preservation and computational efficiency, making it suitable for data-sensitive fields such as telecommunications, automotive, and finance. The findings suggest that CDQKL can effectively leverage distributed quantum computing for large-scale machine-learning tasks.

[CV-160] Cherenkov Imaged Bio-morphological Features Verify Patient Positioning with Deformable Tissue Translocation in Breast Radiotherapy

链接: https://arxiv.org/abs/2409.05680
作者: Yao Chen,Savannah M. Decker,Petr Bruza,David J. Gladstone,Lesley A. Jarvis,Brian W. Pogue,Kimberley S. Samkoe,Rongxiao Zhang
关键词-EN:
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
*备注: 25 pages, 4 figures, 1 table, journal under review

点击查看摘要

[CV-161] Robust Real-time Segmentation of Bio-Morphological Features in Human Cherenkov Imaging during Radiotherapy via Deep Learning

链接: https://arxiv.org/abs/2409.05666
作者: Shiru Wang,Yao Chen,Lesley A. Jarvis,Yucheng Tang,David J. Gladstone,Kimberley S. Samkoe,Brian W. Pogue,Petr Bruza,Rongxiao Zhang
关键词-EN: Radiation Therapy, electron beam delivery, Cherenkov imaging enables, megavoltage X-ray, X-ray or electron
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
*备注: 9 pages, 7 figures, 1 table, journal under review

点击查看摘要

Abstract:Cherenkov imaging enables real-time visualization of megavoltage X-ray or electron beam delivery to the patient during Radiation Therapy (RT). Bio-morphological features, such as vasculature, seen in these images are patient-specific signatures that can be used for verification of positioning and motion management that are essential to precise RT treatment. However until now, no concerted analysis of this biological feature-based tracking was utilized because of the slow speed and accuracy of conventional image processing for feature segmentation. This study demonstrated the first deep learning framework for such an application, achieving video frame rate processing. To address the challenge of limited annotation of these features in Cherenkov images, a transfer learning strategy was applied. A fundus photography dataset including 20,529 patch retina images with ground-truth vessel annotation was used to pre-train a ResNet segmentation framework. Subsequently, a small Cherenkov dataset (1,483 images from 212 treatment fractions of 19 breast cancer patients) with known annotated vasculature masks was used to fine-tune the model for accurate segmentation prediction. This deep learning framework achieved consistent and rapid segmentation of Cherenkov-imaged bio-morphological features on another 19 patients, including subcutaneous veins, scars, and pigmented skin. Average segmentation by the model achieved Dice score of 0.85 and required less than 0.7 milliseconds processing time per instance. The model demonstrated outstanding consistency against input image variances and speed compared to conventional manual segmentation methods, laying the foundation for online segmentation in real-time monitoring in a prospective setting.

[CV-162] Rethinking the Atmospheric Scattering-driven Attention via Channel and Gamma Correction Priors for Low-Light Image Enhancement

链接: https://arxiv.org/abs/2409.05274
作者: Shyang-En Weng,Cheng-Yen Hsiao,Shaou-Gang Miaou
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-163] Label-free evaluation of lung and heart transplant biopsies using virtual staining

链接: https://arxiv.org/abs/2409.05255
作者: Yuzhu Li,Nir Pillar,Tairan Liu,Guangdong Ma,Yuxuan Qi,Kevin de Haan,Yijie Zhang,Xilin Yang,Adrian J. Correa,Guangqian Xiao,Kuang-Yu Jen,Kenneth A. Iczkowski,Yulun Wu,William Dean Wallace,Aydogan Ozcan
关键词-EN: end-stage organ failures, primary therapeutic strategy, Organ transplantation serves, Organ transplantation, organ failures
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 21 Pages, 5 Figures

点击查看摘要

Abstract:Organ transplantation serves as the primary therapeutic strategy for end-stage organ failures. However, allograft rejection is a common complication of organ transplantation. Histological assessment is essential for the timely detection and diagnosis of transplant rejection and remains the gold standard. Nevertheless, the traditional histochemical staining process is time-consuming, costly, and labor-intensive. Here, we present a panel of virtual staining neural networks for lung and heart transplant biopsies, which digitally convert autofluorescence microscopic images of label-free tissue sections into their brightfield histologically stained counterparts, bypassing the traditional histochemical staining process. Specifically, we virtually generated Hematoxylin and Eosin (HE), Masson’s Trichrome (MT), and Elastic Verhoeff-Van Gieson (EVG) stains for label-free transplant lung tissue, along with HE and MT stains for label-free transplant heart tissue. Subsequent blind evaluations conducted by three board-certified pathologists have confirmed that the virtual staining networks consistently produce high-quality histology images with high color uniformity, closely resembling their well-stained histochemical counterparts across various tissue features. The use of virtually stained images for the evaluation of transplant biopsies achieved comparable diagnostic outcomes to those obtained via traditional histochemical staining, with a concordance rate of 82.4% for lung samples and 91.7% for heart samples. Moreover, virtual staining models create multiple stains from the same autofluorescence input, eliminating structural mismatches observed between adjacent sections stained in the traditional workflow, while also saving tissue, expert time, and staining costs.

[CV-164] Zero-Shot Whole Slide Image Retrieval in Histopathology Using Embeddings of Foundation Models DATE

链接: https://arxiv.org/abs/2409.04631
作者: Saghir Alfasly,Peyman Nejat,Ghazal Alabtah,Sobhan Hemati,Krishna Rani Kalari,H.R. Tizhoosh
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper will be updated with more results

点击查看摘要

[CV-165] A Short Survey on Set-Based Aggregation Techniques for Single-Vector WSI Representation in Digital Pathology

链接: https://arxiv.org/abs/2409.04615
作者: S. Hemati,Krishna R. Kalari,H.R. Tizhoosh
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-166] NeCA: 3D Coronary Artery Tree Reconstruction from Two 2D Projections by Neural Implicit Representation

链接: https://arxiv.org/abs/2409.04596
作者: Yiying Wang,Abhirup Banerjee,Vicente Grau
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 10 figures, 6 tables

点击查看摘要

[CV-167] Diff-INR: Generative Regularization for Electrical Impedance Tomography

链接: https://arxiv.org/abs/2409.04494
作者: Bowen Tong,Junwu Wang,Dong Liu
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-168] Pattern based learning and optimisation through pricing for bin packing problem

链接: https://arxiv.org/abs/2409.04456
作者: Huayan Zhang,Ruibin Bai,Tie-Yan Liu,Jiawei Li,Bingchen Lin,Jianfeng Ren
关键词-EN:
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

机器学习

[LG-0] A Framework for Evaluating PM2.5 Forecasts from the Perspective of Individual Decision Making

链接: https://arxiv.org/abs/2409.05866
作者: Renato Berlinghieri,David R. Burt,Paolo Giani,Arlene M. Fiore,Tamara Broderick
关键词-EN: poses health risks, Wildfire frequency, pollution poses health, health risks, resulting air pollution
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 22 pages, 3 figures

点击查看摘要

Abstract:Wildfire frequency is increasing as the climate changes, and the resulting air pollution poses health risks. Just as people routinely use weather forecasts to plan their activities around precipitation, reliable air quality forecasts could help individuals reduce their exposure to air pollution. In the present work, we evaluate several existing forecasts of fine particular matter (PM2.5) within the continental United States in the context of individual decision-making. Our comparison suggests there is meaningful room for improvement in air pollution forecasting, which might be realized by incorporating more data sources and using machine learning tools. To facilitate future machine learning development and benchmarking, we set up a framework to evaluate and compare air pollution forecasts for individual decision making. We introduce a new loss to capture decisions about when to use mitigation measures. We highlight the importance of visualizations when comparing forecasts. Finally, we provide code to download and compare archived forecast predictions.

[LG-1] Robot Utility Models: General Policies for Zero-Shot Deployment in New Environments

链接: https://arxiv.org/abs/2409.05865
作者: Haritheja Etukuru,Norihito Naka,Zijin Hu,Seungjae Lee,Julian Mehu,Aaron Edsinger,Chris Paxton,Soumith Chintala,Lerrel Pinto,Nur Muhammad Mahi Shafiullah
关键词-EN: Robot, navigation capabilities, trained with large, large amounts, plethora of real-world
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Project website this https URL

点击查看摘要

Abstract:Robot models, particularly those trained with large amounts of data, have recently shown a plethora of real-world manipulation and navigation capabilities. Several independent efforts have shown that given sufficient training data in an environment, robot policies can generalize to demonstrated variations in that environment. However, needing to finetune robot models to every new environment stands in stark contrast to models in language or vision that can be deployed zero-shot for open-world problems. In this work, we present Robot Utility Models (RUMs), a framework for training and deploying zero-shot robot policies that can directly generalize to new environments without any finetuning. To create RUMs efficiently, we develop new tools to quickly collect data for mobile manipulation tasks, integrate such data into a policy with multi-modal imitation learning, and deploy policies on-device on Hello Robot Stretch, a cheap commodity robot, with an external mLLM verifier for retrying. We train five such utility models for opening cabinet doors, opening drawers, picking up napkins, picking up paper bags, and reorienting fallen objects. Our system, on average, achieves 90% success rate in unseen, novel environments interacting with unseen objects. Moreover, the utility models can also succeed in different robot and camera set-ups with no further data, training, or fine-tuning. Primary among our lessons are the importance of training data over training algorithm and policy class, guidance about data scaling, necessity for diverse yet high-quality demonstrations, and a recipe for robot introspection and retrying to improve performance on individual environments. Our code, data, models, hardware designs, as well as our experiment and deployment videos are open sourced and can be found on our project website: this https URL

[LG-2] Neural MP: A Generalist Neural Motion Planner

链接: https://arxiv.org/abs/2409.05864
作者: Murtaza Dalal,Jiahui Yang,Russell Mendonca,Youssef Khaky,Ruslan Salakhutdinov,Deepak Pathak
关键词-EN: consumes significant amounts, planning generates solutions, computational resources, motion planning, current paradigm
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Website at this http URL . Main paper: 7 pages, 4 figures, 2 tables. Appendix: 9 pages, 5 figures, 6 tables

点击查看摘要

Abstract:The current paradigm for motion planning generates solutions from scratch for every new problem, which consumes significant amounts of time and computational resources. For complex, cluttered scenes, motion planning approaches can often take minutes to produce a solution, while humans are able to accurately and safely reach any goal in seconds by leveraging their prior experience. We seek to do the same by applying data-driven learning at scale to the problem of motion planning. Our approach builds a large number of complex scenes in simulation, collects expert data from a motion planner, then distills it into a reactive generalist policy. We then combine this with lightweight optimization to obtain a safe path for real world deployment. We perform a thorough evaluation of our method on 64 motion planning tasks across four diverse environments with randomized poses, scenes and obstacles, in the real world, demonstrating an improvement of 23%, 17% and 79% motion planning success rate over state of the art sampling, optimization and learning based planning methods. Video results available at this http URL

[LG-3] Improving Pretraining Data Using Perplexity Correlations

链接: https://arxiv.org/abs/2409.05816
作者: Tristan Thrush,Christopher Potts,Tatsunori Hashimoto
关键词-EN: high-performance language models, Quality pretraining data, Quality pretraining, language models, key to high-performance
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Quality pretraining data is often seen as the key to high-performance language models. However, progress in understanding pretraining data has been slow due to the costly pretraining runs required for data selection experiments. We present a framework that avoids these costs and selects high-quality pretraining data without any LLM training of our own. Our work is based on a simple observation: LLM losses on many pretraining texts are correlated with downstream benchmark performance, and selecting high-correlation documents is an effective pretraining data selection method. We build a new statistical framework for data selection centered around estimates of perplexity-benchmark correlations and perform data selection using a sample of 90 LLMs taken from the Open LLM Leaderboard on texts from tens of thousands of web domains. In controlled pretraining experiments at the 160M parameter scale on 8 benchmarks, our approach outperforms DSIR on every benchmark, while matching the best data selector found in DataComp-LM, a hand-engineered bigram classifier.

[LG-4] Benchmarking Chinese Knowledge Rectification in Large Language Models

链接: https://arxiv.org/abs/2409.05806
作者: Tianhe Lu,Jizhan Fang,Yunzhi Yao,Xin Xu,Ningyu Zhang,Huajun Chen
关键词-EN: Large Language Models, exhibit remarkable generative, remarkable generative capabilities, Language Models, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Ongoing work; code and dataset are available at this https URL

点击查看摘要

Abstract:While Large Language Models (LLMs) exhibit remarkable generative capabilities, they are not without flaws, particularly in the form of hallucinations. This issue is even more pronounced when LLMs are applied to specific languages and domains. For example, LLMs may generate nonsense information when handling Chinese ancient poetry, proverbs, or idioms, owing to the lack of specific knowledge. To this end, this paper introduces a benchmark for rectifying Chinese knowledge in LLMs via knowledge editing. Specifically, we introduce a new Chinese dataset, CKnowEdit, by collecting seven type of knowledge from various sources, including classical texts, idioms, and content from Baidu Tieba Ruozhiba, thereby accounting for the unique polyphony, antithesis, and logical constructs inherent in the Chinese language. Through the analysis of this dataset, we uncover the challenges faced by current LLMs in mastering Chinese. Furthermore, our evaluation of state-of-the-art knowledge editing techniques on this dataset unveil the substantial scope for advancement in the rectification of Chinese knowledge. Code and dataset are available at this https URL.

[LG-5] Celcomen: spatial causal disentanglement for single-cell and tissue perturbation modeling

链接: https://arxiv.org/abs/2409.05804
作者: Stathis Megas,Daniel G. Chen,Krzysztof Polanski,Moshe Eliasof,Carola-Bibiane Schonlieb,Sarah A. Teichmann
关键词-EN: cellular gene regulation, graph neural network, mathematical causality framework, gene regulation programs, generative graph neural
类目: Machine Learning (cs.LG); Tissues and Organs (q-bio.TO)
*备注:

点击查看摘要

Abstract:Celcomen leverages a mathematical causality framework to disentangle intra- and inter- cellular gene regulation programs in spatial transcriptomics and single-cell data through a generative graph neural network. It can learn gene-gene interactions, as well as generate post-perturbation counterfactual spatial transcriptomics, thereby offering access to experimentally inaccessible samples. We validated its disentanglement, identifiability, and counterfactual prediction capabilities through simulations and in clinically relevant human glioblastoma, human fetal spleen, and mouse lung cancer samples. Celcomen provides the means to model disease and therapy induced changes allowing for new insights into single-cell spatially resolved tissue responses relevant to human health.

[LG-6] Input Space Mode Connectivity in Deep Neural Networks

链接: https://arxiv.org/abs/2409.05800
作者: Jakub Vrabel,Ori Shem-Ur,Yaron Oz,David Krueger
关键词-EN: deep neural networks, loss landscape mode, landscape mode connectivity, mode connectivity, extend the concept
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We extend the concept of loss landscape mode connectivity to the input space of deep neural networks. Mode connectivity was originally studied within parameter space, where it describes the existence of low-loss paths between different solutions (loss minimizers) obtained through gradient descent. We present theoretical and empirical evidence of its presence in the input space of deep networks, thereby highlighting the broader nature of the phenomenon. We observe that different input images with similar predictions are generally connected, and for trained models, the path tends to be simple, with only a small deviation from being a linear path. Our methodology utilizes real, interpolated, and synthetic inputs created using the input optimization technique for feature visualization. We conjecture that input space mode connectivity in high-dimensional spaces is a geometric effect that takes place even in untrained models and can be explained through percolation theory. We exploit mode connectivity to obtain new insights about adversarial examples and demonstrate its potential for adversarial detection. Additionally, we discuss applications for the interpretability of deep networks.

[LG-7] Enhancing Preference-based Linear Bandits via Human Response Time

链接: https://arxiv.org/abs/2409.05798
作者: Shen Li,Yuyang Zhang,Zhaolin Ren,Claire Liang,Na Li,Julie A. Shah
关键词-EN: Binary human choice, Binary human, preference strength, response times, feedback is widely
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Econometrics (econ.EM); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Binary human choice feedback is widely used in interactive preference learning for its simplicity, but it provides limited information about preference strength. To overcome this limitation, we leverage human response times, which inversely correlate with preference strength, as complementary information. Our work integrates the EZ-diffusion model, which jointly models human choices and response times, into preference-based linear bandits. We introduce a computationally efficient utility estimator that reformulates the utility estimation problem using both choices and response times as a linear regression problem. Theoretical and empirical comparisons with traditional choice-only estimators reveal that for queries with strong preferences (“easy” queries), choices alone provide limited information, while response times offer valuable complementary information about preference strength. As a result, incorporating response times makes easy queries more useful. We demonstrate this advantage in the fixed-budget best-arm identification problem, with simulations based on three real-world datasets, consistently showing accelerated learning when response times are incorporated.

[LG-8] Predicting Critical Heat Flux with Uncertainty Quantification and Domain Generalization Using Conditional Variational Autoencoders and Deep Neural Networks

链接: https://arxiv.org/abs/2409.05790
作者: Farah Alsafadi,Aidan Furlong,Xu Wu
关键词-EN: CVAE model, generating realistic data, realistic data samples, Deep generative models, CVAE
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep generative models (DGMs) have proven to be powerful in generating realistic data samples. Their capability to learn the underlying distribution of a dataset enable them to generate synthetic data samples that closely resemble the original training dataset, thus addressing the challenge of data scarcity. In this work, we investigated the capabilities of DGMs by developing a conditional variational autoencoder (CVAE) model to augment the critical heat flux (CHF) measurement data that was used to generate the 2006 Groeneveld lookup table. To determine how this approach compared to traditional methods, a fine-tuned deep neural network (DNN) regression model was created and evaluated with the same dataset. Both the CVAE and DNN models achieved small mean absolute relative errors, with the CVAE model maintaining more favorable results. To quantify the uncertainty in the model’s predictions, uncertainty quantification (UQ) was performed with repeated sampling of the CVAE model and ensembling of the DNN model. Following UQ, the DNN ensemble notably improved performance when compared to the baseline DNN model, while the CVAE model achieved similar results to its non-UQ results. The CVAE model was shown to have significantly less variability and a higher confidence after assessment of the prediction-wise relative standard deviations. Evaluating domain generalization, both models achieved small mean error values when predicting both inside and outside the training domain, with predictions outside the training domain showing slightly larger errors. Overall, the CVAE model was comparable to the DNN regression model in predicting CHF values but with better uncertainty behavior.

[LG-9] Leveraging Object Priors for Point Tracking ECCV2024

链接: https://arxiv.org/abs/2409.05786
作者: Bikram Boote,Anh Thai,Wenqi Jia,Ozgur Kara,Stefan Stojanov,James M. Rehg,Sangmin Lee
关键词-EN: Point tracking, fundamental problem, problem in computer, computer vision, vision with numerous
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: ECCV 2024 ILR Workshop

点击查看摘要

Abstract:Point tracking is a fundamental problem in computer vision with numerous applications in AR and robotics. A common failure mode in long-term point tracking occurs when the predicted point leaves the object it belongs to and lands on the background or another object. We identify this as the failure to correctly capture objectness properties in learning to track. To address this limitation of prior work, we propose a novel objectness regularization approach that guides points to be aware of object priors by forcing them to stay inside the the boundaries of object instances. By capturing objectness cues at training time, we avoid the need to compute object masks during testing. In addition, we leverage contextual attention to enhance the feature representation for capturing objectness at the feature level more effectively. As a result, our approach achieves state-of-the-art performance on three point tracking benchmarks, and we further validate the effectiveness of our components via ablation studies. The source code is available at: this https URL

[LG-10] Unified Neural Network Scaling Laws and Scale-time Equivalence

链接: https://arxiv.org/abs/2409.05782
作者: Akhilan Boopathy,Ila Fiete
关键词-EN: data volume, continue to grow, vital to understand, neural networks continue, neural networks
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:As neural networks continue to grow in size but datasets might not, it is vital to understand how much performance improvement can be expected: is it more important to scale network size or data volume? Thus, neural network scaling laws, which characterize how test error varies with network size and data volume, have become increasingly important. However, existing scaling laws are often applicable only in limited regimes and often do not incorporate or predict well-known phenomena such as double descent. Here, we present a novel theoretical characterization of how three factors – model size, training time, and data volume – interact to determine the performance of deep neural networks. We first establish a theoretical and empirical equivalence between scaling the size of a neural network and increasing its training time proportionally. Scale-time equivalence challenges the current practice, wherein large models are trained for small durations, and suggests that smaller models trained over extended periods could match their efficacy. It also leads to a novel method for predicting the performance of large-scale networks from small-scale networks trained for extended epochs, and vice versa. We next combine scale-time equivalence with a linear model analysis of double descent to obtain a unified theoretical scaling law, which we confirm with experiments across vision benchmarks and network architectures. These laws explain several previously unexplained phenomena: reduced data requirements for generalization in larger models, heightened sensitivity to label noise in overparameterized models, and instances where increasing model scale does not necessarily enhance performance. Our findings hold significant implications for the practical deployment of neural networks, offering a more accessible and efficient path to training and fine-tuning large models.

[LG-11] Breaking Neural Network Scaling Laws with Modularity

链接: https://arxiv.org/abs/2409.05780
作者: Akhilan Boopathy,Sunshine Jiang,William Yue,Jaedong Hwang,Abhiram Iyer,Ila Fiete
关键词-EN: visual question answering, answering to robotics, ranging from visual, visual question, question answering
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Modular neural networks outperform nonmodular neural networks on tasks ranging from visual question answering to robotics. These performance improvements are thought to be due to modular networks’ superior ability to model the compositional and combinatorial structure of real-world problems. However, a theoretical explanation of how modularity improves generalizability, and how to leverage task modularity while training networks remains elusive. Using recent theoretical progress in explaining neural network generalization, we investigate how the amount of training data required to generalize on a task varies with the intrinsic dimensionality of a task’s input. We show theoretically that when applied to modularly structured tasks, while nonmodular networks require an exponential number of samples with task dimensionality, modular networks’ sample complexity is independent of task dimensionality: modular networks can generalize in high dimensions. We then develop a novel learning rule for modular networks to exploit this advantage and empirically show the improved generalization of the rule, both in- and out-of-distribution, on high-dimensional, modular tasks.

[LG-12] Advanced LSTM Neural Networks for Predicting Directional Changes in Sector-Specific ETFs Using Machine Learning Techniques

链接: https://arxiv.org/abs/2409.05778
作者: Rifa Gowani,Zaryab Kanjiani
关键词-EN: supplementary income stream, Trading and investing, full-time career, income stream, simply a supplementary
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Trading and investing in stocks for some is their full-time career, while for others, it’s simply a supplementary income stream. Universal among all investors is the desire to turn a profit. The key to achieving this goal is diversification. Spreading investments across sectors is critical to profitability and maximizing returns. This study aims to gauge the viability of machine learning methods in practicing the principle of diversification to maximize portfolio returns. To test this, the study evaluates the Long-Short Term Memory (LSTM) model across nine different sectors and over 2,200 stocks using Vanguard’s sector-based ETFs. The R-squared value across all sectors showed promising results, with an average of 0.8651 and a high of 0.942 for the VNQ ETF. These findings suggest that the LSTM model is a capable and viable model for accurately predicting directional changes across various industry sectors, helping investors diversify and grow their portfolios.

[LG-13] Are Heterophily-Specific GNNs and Homophily Metrics Really Effective? Evaluation Pitfalls and New Benchmarks

链接: https://arxiv.org/abs/2409.05755
作者: Sitao Luan,Qincheng Lu,Chenqing Hua,Xinyu Wang,Jiaqi Zhu,Xiao-Wen Chang,Guy Wolf,Jian Tang
关键词-EN: Graph Neural Networks, Neural Networks, achieved great success, machine learning tasks, Graph Neural
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2407.09618

点击查看摘要

Abstract:Over the past decade, Graph Neural Networks (GNNs) have achieved great success on machine learning tasks with relational data. However, recent studies have found that heterophily can cause significant performance degradation of GNNs, especially on node-level tasks. Numerous heterophilic benchmark datasets have been put forward to validate the efficacy of heterophily-specific GNNs and various homophily metrics have been designed to help people recognize these malignant datasets. Nevertheless, there still exist multiple pitfalls that severely hinder the proper evaluation of new models and metrics. In this paper, we point out three most serious pitfalls: 1) a lack of hyperparameter tuning; 2) insufficient model evaluation on the real challenging heterophilic datasets; 3) missing quantitative evaluation benchmark for homophily metrics on synthetic graphs. To overcome these challenges, we first train and fine-tune baseline models on 27 most widely used benchmark datasets, categorize them into three distinct groups: malignant, benign and ambiguous heterophilic datasets, and identify the real challenging subsets of tasks. To our best knowledge, we are the first to propose such taxonomy. Then, we re-evaluate 10 heterophily-specific state-of-the-arts (SOTA) GNNs with fine-tuned hyperparameters on different groups of heterophilic datasets. Based on the model performance, we reassess their effectiveness on addressing heterophily challenge. At last, we evaluate 11 popular homophily metrics on synthetic graphs with three different generation approaches. To compare the metrics strictly, we propose the first quantitative evaluation method based on Fréchet distance.

[LG-14] pFedGPA: Diffusion-based Generative Parameter Aggregation for Personalized Federated Learning

链接: https://arxiv.org/abs/2409.05701
作者: Jiahao Lai,Jiaqi Li,Jian Xu,Yanru Wu,Boshi Tang,Siqi Chen,Yongfeng Huang,Wenbo Ding,Yang Li
关键词-EN: Federated Learning, Federated Averaging, data remains local, offers a decentralized, model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) offers a decentralized approach to model training, where data remains local and only model parameters are shared between the clients and the central server. Traditional methods, such as Federated Averaging (FedAvg), linearly aggregate these parameters which are usually trained on heterogeneous data distributions, potentially overlooking the complex, high-dimensional nature of the parameter space. This can result in degraded performance of the aggregated model. While personalized FL approaches can mitigate the heterogeneous data issue to some extent, the limitation of linear aggregation remains unresolved. To alleviate this issue, we investigate the generative approach of diffusion model and propose a novel generative parameter aggregation framework for personalized FL, \textttpFedGPA. In this framework, we deploy a diffusion model on the server to integrate the diverse parameter distributions and propose a parameter inversion method to efficiently generate a set of personalized parameters for each client. This inversion method transforms the uploaded parameters into a latent code, which is then aggregated through denoising sampling to produce the final personalized parameters. By encoding the dependence of a client’s model parameters on the specific data distribution using the high-capacity diffusion model, \textttpFedGPA can effectively decouple the complexity of the overall distribution of all clients’ model parameters from the complexity of each individual client’s parameter distribution. Our experimental results consistently demonstrate the superior performance of the proposed method across multiple datasets, surpassing baseline approaches.

[LG-15] MANA-Net: Mitigating Aggregated Sentiment Homogenization with News Weighting for Enhanced Market Prediction CIKM24

链接: https://arxiv.org/abs/2409.05698
作者: Mengyu Wang,Tiejun Ma
关键词-EN: widely acknowledged, acknowledged that extracting, Aggregated Sentiment Homogenization, extracting market sentiments, data benefits market
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computational Finance (q-fin.CP)
*备注: Accepted by CIKM 24

点击查看摘要

Abstract:It is widely acknowledged that extracting market sentiments from news data benefits market predictions. However, existing methods of using financial sentiments remain simplistic, relying on equal-weight and static aggregation to manage sentiments from multiple news items. This leads to a critical issue termed ``Aggregated Sentiment Homogenization’', which has been explored through our analysis of a large financial news dataset from industry practice. This phenomenon occurs when aggregating numerous sentiments, causing representations to converge towards the mean values of sentiment distributions and thereby smoothing out unique and important information. Consequently, the aggregated sentiment representations lose much predictive value of news data. To address this problem, we introduce the Market Attention-weighted News Aggregation Network (MANA-Net), a novel method that leverages a dynamic market-news attention mechanism to aggregate news sentiments for market prediction. MANA-Net learns the relevance of news sentiments to price changes and assigns varying weights to individual news items. By integrating the news aggregation step into the networks for market prediction, MANA-Net allows for trainable sentiment representations that are optimized directly for prediction. We evaluate MANA-Net using the SP 500 and NASDAQ 100 indices, along with financial news spanning from 2003 to 2018. Experimental results demonstrate that MANA-Net outperforms various recent market prediction methods, enhancing Profit Loss by 1.1% and the daily Sharpe ratio by 0.252.

[LG-16] Segmentation by Factorization: Unsupervised Semantic Segmentation for Pathology by Factorizing Foundation Model Features

链接: https://arxiv.org/abs/2409.05697
作者: Jacob Gildenblat,Ofir Hadar
关键词-EN: deep learning models, deep learning, pre-trained deep learning, Segmentation, Genome Atlas Program
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Segmentation by Factorization (F-SEG), an unsupervised segmentation method for pathology that generates segmentation masks from pre-trained deep learning models. F-SEG allows the use of pre-trained deep neural networks, including recently developed pathology foundation models, for semantic segmentation. It achieves this without requiring additional training or finetuning, by factorizing the spatial features extracted by the models into segmentation masks and their associated concept features. We create generic tissue phenotypes for HE images by training clustering models for multiple numbers of clusters on features extracted from several deep learning models on The Cancer Genome Atlas Program (TCGA), and then show how the clusters can be used for factorizing corresponding segmentation masks using off-the-shelf deep learning models. Our results show that F-SEG provides robust unsupervised segmentation capabilities for HE pathology images, and that the segmentation quality is greatly improved by utilizing pathology foundation models. We discuss and propose methods for evaluating the performance of unsupervised segmentation in pathology.

[LG-17] Extracting the U.S. building types from OpenStreetMap data

链接: https://arxiv.org/abs/2409.05692
作者: Henrique F. de Arruda,Sandro M. Reia,Shiyang Ruan,Kuldip S. Atwal,Hamdi Kavak,Taylor Anderson,Dieter Pfoser
关键词-EN: emergency response applications, traffic planning, population estimation, response applications, crucial for population
类目: ocial and Information Networks (cs.SI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Building type information is crucial for population estimation, traffic planning, urban planning, and emergency response applications. Although essential, such data is often not readily available. To alleviate this problem, this work creates a comprehensive dataset by providing residential/non-residential building classification covering the entire United States. We propose and utilize an unsupervised machine learning method to classify building types based on building footprints and available OpenStreetMap information. The classification result is validated using authoritative ground truth data for select counties in the U.S. The validation shows a high precision for non-residential building classification and a high recall for residential buildings. We identified various approaches to improving the quality of the classification, such as removing sheds and garages from the dataset. Furthermore, analyzing the misclassifications revealed that they are mainly due to missing and scarce metadata in OSM. A major result of this work is the resulting dataset of classifying 67,705,475 buildings. We hope that this data is of value to the scientific community, including urban and transportation planners.

[LG-18] Zero-shot Outlier Detection via Prior-data Fitted Networks: Model Selection Bygone!

链接: https://arxiv.org/abs/2409.05672
作者: Yuchen Shen,Haomin Wen,Leman Akoglu
关键词-EN: finds numerous applications, environmental monitoring, finds numerous, numerous applications, applications in environmental
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: preprint

点击查看摘要

Abstract:Outlier detection (OD) has a vast literature as it finds numerous applications in environmental monitoring, cybersecurity, finance, and medicine to name a few. Being an inherently unsupervised task, model selection is a key bottleneck for OD (both algorithm and hyperparameter selection) without label supervision. There is a long list of techniques to choose from – both classical algorithms and deep neural architectures – and while several studies report their hyperparameter sensitivity, the literature is quite slim on unsupervised model selection – limiting the effective use of OD in practice. In this paper we present FoMo-0D, for zero/0-shot OD exploring a transformative new direction that bypasses the hurdle of model selection altogether (!), thus breaking new ground. The fundamental idea behind FoMo-0D is the Prior-data Fitted Networks, recently introduced by Muller et al.(2022), which trains a Transformer model on a large body of synthetically generated data from a prior data distribution. In essence, FoMo-0D is a pretrained Foundation Model for zero/0-shot OD on tabular data, which can directly predict the (outlier/inlier) label of any test data at inference time, by merely a single forward pass – making obsolete the need for choosing an algorithm/architecture, tuning its associated hyperparameters, and even training any model parameters when given a new OD dataset. Extensive experiments on 57 public benchmark datasets against 26 baseline methods show that FoMo-0D performs statistically no different from the top 2nd baseline, while significantly outperforming the majority of the baselines, with an average inference time of 7.7 ms per test sample.

[LG-19] Unlearning or Concealment? A Critical Analysis and Evaluation Metrics for Unlearning in Diffusion Models

链接: https://arxiv.org/abs/2409.05668
作者: Aakash Sen Sharma,Niladri Sarkar,Vikram Chundawat,Ankur A Mali,Murari Mandal
关键词-EN: Recent research, diffusion models, model unlearning methods, existing unlearning methods, unlearning methods
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent research has seen significant interest in methods for concept removal and targeted forgetting in diffusion models. In this paper, we conduct a comprehensive white-box analysis to expose significant vulnerabilities in existing diffusion model unlearning methods. We show that the objective functions used for unlearning in the existing methods lead to decoupling of the targeted concepts (meant to be forgotten) for the corresponding prompts. This is concealment and not actual unlearning, which was the original goal. The ineffectiveness of current methods stems primarily from their narrow focus on reducing generation probabilities for specific prompt sets, neglecting the diverse modalities of intermediate guidance employed during the inference process. The paper presents a rigorous theoretical and empirical examination of four commonly used techniques for unlearning in diffusion models. We introduce two new evaluation metrics: Concept Retrieval Score (CRS) and Concept Confidence Score (CCS). These metrics are based on a successful adversarial attack setup that can recover forgotten concepts from unlearned diffusion models. The CRS measures the similarity between the latent representations of the unlearned and fully trained models after unlearning. It reports the extent of retrieval of the forgotten concepts with increasing amount of guidance. The CCS quantifies the confidence of the model in assigning the target concept to the manipulated data. It reports the probability of the unlearned model’s generations to be aligned with the original domain knowledge with increasing amount of guidance. Evaluating existing unlearning methods with our proposed stringent metrics for diffusion models reveals significant shortcomings in their ability to truly unlearn concepts. Source Code: this https URL

[LG-20] Real-Time Human Action Recognition on Embedded Platforms

链接: https://arxiv.org/abs/2409.05662
作者: Ruiqi Wang,Zichen Wang,Peiqi Gao,Mingzhen Li,Jaehwan Jeong,Yihang Xu,Yejin Lee,Lisa Connor,Chenyang Lu
关键词-EN: video-based human action, human action recognition, motion feature extractor, video-based human, Integrated Motion Feature
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With advancements in computer vision and deep learning, video-based human action recognition (HAR) has become practical. However, due to the complexity of the computation pipeline, running HAR on live video streams incurs excessive delays on embedded platforms. This work tackles the real-time performance challenges of HAR with four contributions: 1) an experimental study identifying a standard Optical Flow (OF) extraction technique as the latency bottleneck in a state-of-the-art HAR pipeline, 2) an exploration of the latency-accuracy tradeoff between the standard and deep learning approaches to OF extraction, which highlights the need for a novel, efficient motion feature extractor, 3) the design of Integrated Motion Feature Extractor (IMFE), a novel single-shot neural network architecture for motion feature extraction with drastic improvement in latency, 4) the development of RT-HARE, a real-time HAR system tailored for embedded platforms. Experimental results on an Nvidia Jetson Xavier NX platform demonstrated that RT-HARE realizes real-time HAR at a video frame rate of 30 frames per second while delivering high levels of recognition accuracy.

[LG-21] Adversarial Attacks on Data Attribution

链接: https://arxiv.org/abs/2409.05657
作者: Xinhe Wang,Pingbang Hu,Junwei Deng,Jiaqi W. Ma
关键词-EN: compensate data providers, Shadow Attack, Data attribution, Outlier Attack, data attribution methods
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data attribution aims to quantify the contribution of individual training data points to the outputs of an AI model, which has been used to measure the value of training data and compensate data providers. Given the impact on financial decisions and compensation mechanisms, a critical question arises concerning the adversarial robustness of data attribution methods. However, there has been little to no systematic research addressing this issue. In this work, we aim to bridge this gap by detailing a threat model with clear assumptions about the adversary’s goal and capabilities, and by proposing principled adversarial attack methods on data attribution. We present two such methods, Shadow Attack and Outlier Attack, both of which generate manipulated datasets to adversarially inflate the compensation. The Shadow Attack leverages knowledge about the data distribution in the AI applications, and derives adversarial perturbations through “shadow training”, a technique commonly used in membership inference attacks. In contrast, the Outlier Attack does not assume any knowledge about the data distribution and relies solely on black-box queries to the target model’s predictions. It exploits an inductive bias present in many data attribution methods - outlier data points are more likely to be influential - and employs adversarial examples to generate manipulated datasets. Empirically, in image classification and text generation tasks, the Shadow Attack can inflate the data-attribution-based compensation by at least 200%, while the Outlier Attack achieves compensation inflation ranging from 185% to as much as 643%.

[LG-22] Interactive incremental learning of generalizable skills with local trajectory modulation

链接: https://arxiv.org/abs/2409.05655
作者: Markus Knauer,Alin Albu-Schäffer,Freek Stulp,João Silvério
关键词-EN: received considerable attention, received considerable, movement primitives, approaches have emerged, considerable attention
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 21 pages, 16 figures

点击查看摘要

Abstract:The problem of generalization in learning from demonstration (LfD) has received considerable attention over the years, particularly within the context of movement primitives, where a number of approaches have emerged. Recently, two important approaches have gained recognition. While one leverages via-points to adapt skills locally by modulating demonstrated trajectories, another relies on so-called task-parameterized models that encode movements with respect to different coordinate systems, using a product of probabilities for generalization. While the former are well-suited to precise, local modulations, the latter aim at generalizing over large regions of the workspace and often involve multiple objects. Addressing the quality of generalization by leveraging both approaches simultaneously has received little attention. In this work, we propose an interactive imitation learning framework that simultaneously leverages local and global modulations of trajectory distributions. Building on the kernelized movement primitives (KMP) framework, we introduce novel mechanisms for skill modulation from direct human corrective feedback. Our approach particularly exploits the concept of via-points to incrementally and interactively 1) improve the model accuracy locally, 2) add new objects to the task during execution and 3) extend the skill into regions where demonstrations were not provided. We evaluate our method on a bearing ring-loading task using a torque-controlled, 7-DoF, DLR SARA robot.

[LG-23] Forward KL Regularized Preference Optimization for Aligning Diffusion Policies

链接: https://arxiv.org/abs/2409.05622
作者: Zhao Shan,Chenyou Fan,Shuang Qiu,Jiyuan Shi,Chenjia Bai
关键词-EN: expressive model capabilities, highly expressive model, achieved remarkable success, expressive model, model capabilities
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have achieved remarkable success in sequential decision-making by leveraging the highly expressive model capabilities in policy learning. A central problem for learning diffusion policies is to align the policy output with human intents in various tasks. To achieve this, previous methods conduct return-conditioned policy generation or Reinforcement Learning (RL)-based policy optimization, while they both rely on pre-defined reward functions. In this work, we propose a novel framework, Forward KL regularized Preference optimization for aligning Diffusion policies, to align the diffusion policy with preferences directly. We first train a diffusion policy from the offline dataset without considering the preference, and then align the policy to the preference data via direct preference optimization. During the alignment phase, we formulate direct preference learning in a diffusion policy, where the forward KL regularization is employed in preference optimization to avoid generating out-of-distribution actions. We conduct extensive experiments for MetaWorld manipulation and D4RL tasks. The results show our method exhibits superior alignment with preferences and outperforms previous state-of-the-art algorithms.

[LG-24] Joint Input and Output Coordination for Class-Incremental Learning IJCAI2024

链接: https://arxiv.org/abs/2409.05620
作者: Shuai Wang,Yibing Zhan,Yong Luo,Han Hu,Wei Yu,Yonggang Wen,Dacheng Tao
关键词-EN: severe catastrophic forgetting, Incremental learning, catastrophic forgetting, nontrivial due, due to severe
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages, 4 figues. Accepted by IJCAI 2024

点击查看摘要

Abstract:Incremental learning is nontrivial due to severe catastrophic forgetting. Although storing a small amount of data on old tasks during incremental learning is a feasible solution, current strategies still do not 1) adequately address the class bias problem, and 2) alleviate the mutual interference between new and old tasks, and 3) consider the problem of class bias within tasks. This motivates us to propose a joint input and output coordination (JIOC) mechanism to address these issues. This mechanism assigns different weights to different categories of data according to the gradient of the output score, and uses knowledge distillation (KD) to reduce the mutual interference between the outputs of old and new tasks. The proposed mechanism is general and flexible, and can be incorporated into different incremental learning approaches that use memory storage. Extensive experiments show that our mechanism can significantly improve their performance.

[LG-25] Normalizing Energy Consumption for Hardware-Independent Evaluation

链接: https://arxiv.org/abs/2409.05602
作者: Constance Douwes,Romain Serizel
关键词-EN: resource-intensive training phases, machine learning, models in signal, environmental impact, training phases
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing use of machine learning (ML) models in signal processing has raised concerns about their environmental impact, particularly during resource-intensive training phases. In this study, we present a novel methodology for normalizing energy consumption across different hardware platforms to facilitate fair and consistent comparisons. We evaluate different normalization strategies by measuring the energy used to train different ML architectures on different GPUs, focusing on audio tagging tasks. Our approach shows that the number of reference points, the type of regression and the inclusion of computational metrics significantly influences the normalization process. We find that the appropriate selection of two reference points provides robust normalization, while incorporating the number of floating-point operations and parameters improves the accuracy of energy consumption predictions. By supporting more accurate energy consumption evaluation, our methodology promotes the development of environmentally sustainable ML practices.

[LG-26] SynMorph: Generating Synthetic Face Morphing Dataset with Mated Samples

链接: https://arxiv.org/abs/2409.05595
作者: Haoyu Zhang,Raghavendra Ramachandra,Kiran Raja,Christoph Busch
关键词-EN: synthetic face morphing, morphing attack detection, face morphing dataset, face recognition systems, proposed synthetic face
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Face morphing attack detection (MAD) algorithms have become essential to overcome the vulnerability of face recognition systems. To solve the lack of large-scale and public-available datasets due to privacy concerns and restrictions, in this work we propose a new method to generate a synthetic face morphing dataset with 2450 identities and more than 100k morphs. The proposed synthetic face morphing dataset is unique for its high-quality samples, different types of morphing algorithms, and the generalization for both single and differential morphing attack detection algorithms. For experiments, we apply face image quality assessment and vulnerability analysis to evaluate the proposed synthetic face morphing dataset from the perspective of biometric sample quality and morphing attack potential on face recognition systems. The results are benchmarked with an existing SOTA synthetic dataset and a representative non-synthetic and indicate improvement compared with the SOTA. Additionally, we design different protocols and study the applicability of using the proposed synthetic dataset on training morphing attack detection algorithms.

[LG-27] Learning to Model Graph Structural Information on MLPs via Graph Structure Self-Contrasting

链接: https://arxiv.org/abs/2409.05573
作者: Lirong Wu,Haitao Lin,Guojiang Zhao,Cheng Tan,Stan Z. Li
关键词-EN: Graph Neural Networks, Neural Networks, witnessed great success, handling graph-related tasks, Recent years
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent years have witnessed great success in handling graph-related tasks with Graph Neural Networks (GNNs). However, most existing GNNs are based on message passing to perform feature aggregation and transformation, where the structural information is explicitly involved in the forward propagation by coupling with node features through graph convolution at each layer. As a result, subtle feature noise or structure perturbation may cause severe error propagation, resulting in extremely poor robustness. In this paper, we rethink the roles played by graph structural information in graph data training and identify that message passing is not the only path to modeling structural information. Inspired by this, we propose a simple but effective Graph Structure Self-Contrasting (GSSC) framework that learns graph structural information without message passing. The proposed framework is based purely on Multi-Layer Perceptrons (MLPs), where the structural information is only implicitly incorporated as prior knowledge to guide the computation of supervision signals, substituting the explicit message propagation as in GNNs. Specifically, it first applies structural sparsification to remove potentially uninformative or noisy edges in the neighborhood, and then performs structural self-contrasting in the sparsified neighborhood to learn robust node representations. Finally, structural sparsification and self-contrasting are formulated as a bi-level optimization problem and solved in a unified framework. Extensive experiments have qualitatively and quantitatively demonstrated that the GSSC framework can produce truly encouraging performance with better generalization and robustness than other leading competitors.

[LG-28] SciAgents : Automating scientific discovery through multi-agent intelligent graph reasoning

链接: https://arxiv.org/abs/2409.05556
作者: Alireza Ghafarollahi,Markus J. Buehler
关键词-EN: identifying complex patterns, advancing scientific understanding, uncovering previously unseen, previously unseen connections, vast scientific data
类目: Artificial Intelligence (cs.AI); Disordered Systems and Neural Networks (cond-mat.dis-nn); Materials Science (cond-mat.mtrl-sci); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A key challenge in artificial intelligence is the creation of systems capable of autonomously advancing scientific understanding by exploring novel domains, identifying complex patterns, and uncovering previously unseen connections in vast scientific data. In this work, we present SciAgents, an approach that leverages three core concepts: (1) the use of large-scale ontological knowledge graphs to organize and interconnect diverse scientific concepts, (2) a suite of large language models (LLMs) and data retrieval tools, and (3) multi-agent systems with in-situ learning capabilities. Applied to biologically inspired materials, SciAgents reveals hidden interdisciplinary relationships that were previously considered unrelated, achieving a scale, precision, and exploratory power that surpasses traditional human-driven research methods. The framework autonomously generates and refines research hypotheses, elucidating underlying mechanisms, design principles, and unexpected material properties. By integrating these capabilities in a modular fashion, the intelligent system yields material discoveries, critique and improve existing hypotheses, retrieve up-to-date data about existing research, and highlights their strengths and limitations. Our case studies demonstrate scalable capabilities to combine generative AI, ontological representations, and multi-agent modeling, harnessing a `swarm of intelligence’ similar to biological systems. This provides new avenues for materials discovery and accelerates the development of advanced materials by unlocking Nature’s design principles.

[LG-29] CoBo: Collaborative Learning via Bilevel Optimization

链接: https://arxiv.org/abs/2409.05539
作者: Diba Hashemi,Lie He,Martin Jaggi
关键词-EN: train multiple clients, important tool, tool to train, train multiple, effectively by enabling
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Collaborative learning is an important tool to train multiple clients more effectively by enabling communication among clients. Identifying helpful clients, however, presents challenging and often introduces significant overhead. In this paper, we model client-selection and model-training as two interconnected optimization problems, proposing a novel bilevel optimization problem for collaborative learning. We introduce CoBo, a scalable and elastic, SGD-type alternating optimization algorithm that efficiently addresses these problem with theoretical convergence guarantees. Empirically, CoBo achieves superior performance, surpassing popular personalization algorithms by 9.3% in accuracy on a task with high heterogeneity, involving datasets distributed among 80 clients.

[LG-30] Interpolation Extrapolation Hyperpolation: Generalising into new dimensions

链接: https://arxiv.org/abs/2409.05513
作者: Toby Ord
关键词-EN: familiar concepts, concepts of interpolation, interpolation and extrapolation, paper introduces, data points
类目: Machine Learning (cs.LG)
*备注: 22 pages, 8 figures

点击查看摘要

Abstract:This paper introduces the concept of hyperpolation: a way of generalising from a limited set of data points that is a peer to the more familiar concepts of interpolation and extrapolation. Hyperpolation is the task of estimating the value of a function at new locations that lie outside the subspace (or manifold) of the existing data. We shall see that hyperpolation is possible and explore its links to creativity in the arts and sciences. We will also examine the role of hyperpolation in machine learning and suggest that the lack of fundamental creativity in current AI systems is deeply connected to their limited ability to hyperpolate.

[LG-31] A general reduced-order neural operator for spatio-temporal predictive learning on complex spatial domains

链接: https://arxiv.org/abs/2409.05508
作者: Qinglu Meng,Yingguang Li,Zhiliang Deng,Xu Liu,Gengxiang Chen,Qiutong Wu,Changqing Liu,Xiaozhong Hao
关键词-EN: reduced-order neural operator, infinite-dimensional function spaces, plays a critical, critical role, complex spatial domains
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predictive learning for spatio-temporal processes (PL-STP) on complex spatial domains plays a critical role in various scientific and engineering fields, with its essence being the construction of operators between infinite-dimensional function spaces. This paper focuses on the unequal-domain mappings in PL-STP and categorising them into increase-domain and decrease-domain mapping. Recent advances in deep learning have revealed the great potential of neural operators (NOs) to learn operators directly from observational data. However, existing NOs require input space and output space to be the same domain, which pose challenges in ensuring predictive accuracy and stability for unequal-domain mappings. To this end, this study presents a general reduced-order neural operator named Reduced-Order Neural Operator on Riemannian Manifolds (RO-NORM), which consists of two parts: the unequal-domain encoder/decoder and the same-domain approximator. Motivated by the variable separation in classical modal decomposition, the unequal-domain encoder/decoder uses the pre-computed bases to reformulate the spatio-temporal function as a sum of products between spatial (or temporal) bases and corresponding temporally (or spatially) distributed weight functions, thus the original unequal-domain mapping can be converted into a same-domain mapping. Consequently, the same-domain approximator NORM is applied to model the transformed mapping. The performance of our proposed method has been evaluated on six benchmark cases, including parametric PDEs, engineering and biomedical applications, and compared with four baseline algorithms: DeepONet, POD-DeepONet, PCA-Net, and vanilla NORM. The experimental results demonstrate the superiority of RO-NORM in prediction accuracy and training efficiency for PL-STP.

[LG-32] Optimizing VarLiNGAM for Scalable and Efficient Time Series Causal Discovery

链接: https://arxiv.org/abs/2409.05500
作者: Ziyang Jiao,Ce Guo,Wayne Luk
关键词-EN: combines Vector Autoregressive, Linear Non-Gaussian Acyclic, Vector Autoregressive Model, Non-Gaussian Acyclic Model, Vector Autoregressive
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Causal discovery is designed to identify causal relationships in data, a task that has become increasingly complex due to the computational demands of traditional methods such as VarLiNGAM, which combines Vector Autoregressive Model with Linear Non-Gaussian Acyclic Model for time series data. This study is dedicated to optimising causal discovery specifically for time series data, which is common in practical applications. Time series causal discovery is particularly challenging due to the need to account for temporal dependencies and potential time lag effects. By designing a specialised dataset generator and reducing the computational complexity of the VarLiNGAM model from ( O(m^3 \cdot n) ) to ( O(m^3 + m^2 \cdot n) ), this study significantly improves the feasibility of processing large datasets. The proposed methods have been validated on advanced computational platforms and tested across simulated, real-world, and large-scale datasets, showcasing enhanced efficiency and performance. The optimised algorithm achieved 7 to 13 times speedup compared with the original algorithm and around 4.5 times speedup compared with the GPU-accelerated version on large-scale datasets with feature sizes between 200 and 400. Our methods aim to push the boundaries of current causal discovery capabilities, making them more robust, scalable, and applicable to real-world scenarios, thus facilitating breakthroughs in various fields such as healthcare and finance. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF); Computation (stat.CO) Cite as: arXiv:2409.05500 [cs.LG] (or arXiv:2409.05500v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.05500 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-33] Using machine learning for fault detection in lighthouse light sensors

链接: https://arxiv.org/abs/2409.05495
作者: Michael Kampouridis,Nikolaos Vastardis,George Rayment
关键词-EN: ensuring maritime safety, signaling hazardous areas, aiding harbor entries, dangerous coastlines, aerial navigation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Lighthouses play a crucial role in ensuring maritime safety by signaling hazardous areas such as dangerous coastlines, shoals, reefs, and rocks, along with aiding harbor entries and aerial navigation. This is achieved through the use of photoresistor sensors that activate or deactivate based on the time of day. However, a significant issue is the potential malfunction of these sensors, leading to the gradual misalignment of the light’s operational timing. This paper introduces an innovative machine learning-based approach for automatically detecting such malfunctions. We evaluate four distinct algorithms: decision trees, random forest, extreme gradient boosting, and multi-layer perceptron. Our findings indicate that the multi-layer perceptron is the most effective, capable of detecting timing discrepancies as small as 10-15 minutes. This accuracy makes it a highly efficient tool for automating the detection of faults in lighthouse light sensors.

[LG-34] CRADLE-VAE: Enhancing Single-Cell Gene Perturbation Modeling with Counterfactual Reasoning-based Artifact Disentanglement

链接: https://arxiv.org/abs/2409.05484
作者: Seungheun Baek,Soyon Park,Yan Ting Chok,Junhyun Lee,Jueon Park,Mogan Gim,Jaewoo Kang
关键词-EN: Predicting cellular responses, deep learning models, learning models playing, Predicting cellular, personalized therapeutics
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Predicting cellular responses to various perturbations is a critical focus in drug discovery and personalized therapeutics, with deep learning models playing a significant role in this endeavor. Single-cell datasets contain technical artifacts that may hinder the predictability of such models, which poses quality control issues highly regarded in this area. To address this, we propose CRADLE-VAE, a causal generative framework tailored for single-cell gene perturbation modeling, enhanced with counterfactual reasoning-based artifact disentanglement. Throughout training, CRADLE-VAE models the underlying latent distribution of technical artifacts and perturbation effects present in single-cell datasets. It employs counterfactual reasoning to effectively disentangle such artifacts by modulating the latent basal spaces and learns robust features for generating cellular response data with improved quality. Experimental results demonstrate that this approach improves not only treatment effect estimation performance but also generative quality as well. The CRADLE-VAE codebase is publicly available at this https URL.

[LG-35] Retrofitting Temporal Graph Neural Networks with Transformer

链接: https://arxiv.org/abs/2409.05477
作者: Qiang Huang,Xiao Yan,Xin Wang,Susie Xi Rao,Zhichao Han,Fangcheng Fu,Wentao Zhang,Jiawei Jiang
关键词-EN: outperform regular GNNs, incorporating time information, graph neural networks, neural networks, outperform regular
类目: Machine Learning (cs.LG)
*备注: conference Under review

点击查看摘要

Abstract:Temporal graph neural networks (TGNNs) outperform regular GNNs by incorporating time information into graph-based operations. However, TGNNs adopt specialized models (e.g., TGN, TGAT, and APAN ) and require tailored training frameworks (e.g., TGL and ETC). In this paper, we propose TF-TGN, which uses Transformer decoder as the backbone model for TGNN to enjoy Transformer’s codebase for efficient training. In particular, Transformer achieves tremendous success for language modeling, and thus the community developed high-performance kernels (e.g., flash-attention and memory-efficient attention) and efficient distributed training schemes (e.g., PyTorch FSDP, DeepSpeed, and Megatron-LM). We observe that TGNN resembles language modeling, i.e., the message aggregation operation between chronologically occurring nodes and their temporal neighbors in TGNNs can be structured as sequence modeling. Beside this similarity, we also incorporate a series of algorithm designs including suffix infilling, temporal graph attention with self-loop, and causal masking self-attention to make TF-TGN work. During training, existing systems are slow in transforming the graph topology and conducting graph sampling. As such, we propose methods to parallelize the CSR format conversion and graph sampling. We also adapt Transformer codebase to train TF-TGN efficiently with multiple GPUs. We experiment with 9 graphs and compare with 2 state-of-the-art TGNN training frameworks. The results show that TF-TGN can accelerate training by over 2.20 while providing comparable or even superior accuracy to existing SOTA TGNNs. TF-TGN is available at this https URL.

[LG-36] Beyond Flatland: A Geometric Take on Matching Methods for Treatment Effect Estimation

链接: https://arxiv.org/abs/2409.05459
作者: Melanie F. Pradier,Javier González
关键词-EN: estimate treatment effects, estimate treatment, popular approach, pairing treated, treated and control
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Matching is a popular approach in causal inference to estimate treatment effects by pairing treated and control units that are most similar in terms of their covariate information. However, classic matching methods completely ignore the geometry of the data manifold, which is crucial to define a meaningful distance for matching, and struggle when covariates are noisy and high-dimensional. In this work, we propose GeoMatching, a matching method to estimate treatment effects that takes into account the intrinsic data geometry induced by existing causal mechanisms among the confounding variables. First, we learn a low-dimensional, latent Riemannian manifold that accounts for uncertainty and geometry of the original input data. Second, we estimate treatment effects via matching in the latent space based on the learned latent Riemannian metric. We provide theoretical insights and empirical results in synthetic and real-world scenarios, demonstrating that GeoMatching yields more effective treatment effect estimators, even as we increase input dimensionality, in the presence of outliers, or in semi-supervised scenarios.

[LG-37] State-Novelty Guided Action Persistence in Deep Reinforcement Learning

链接: https://arxiv.org/abs/2409.05433
作者: Jianshu Hu,Paul Weng,Yutong Ban
关键词-EN: deep reinforcement learning, promising approach, deep reinforcement, reinforcement learning, exploration-exploitation dilemma
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:While a powerful and promising approach, deep reinforcement learning (DRL) still suffers from sample inefficiency, which can be notably improved by resorting to more sophisticated techniques to address the exploration-exploitation dilemma. One such technique relies on action persistence (i.e., repeating an action over multiple steps). However, previous work exploiting action persistence either applies a fixed strategy or learns additional value functions (or policy) for selecting the repetition number. In this paper, we propose a novel method to dynamically adjust the action persistence based on the current exploration status of the state space. In such a way, our method does not require training of additional value functions or policy. Moreover, the use of a smooth scheduling of the repeat probability allows a more effective balance between exploration and exploitation. Furthermore, our method can be seamlessly integrated into various basic exploration strategies to incorporate temporal persistence. Finally, extensive experiments on different DMControl tasks demonstrate that our state-novelty guided action persistence method significantly improves the sample efficiency.

[LG-38] HyperSMOTE: A Hypergraph-based Oversampling Approach for Imbalanced Node Classifications

链接: https://arxiv.org/abs/2409.05402
作者: Ziming Zhao,Tiehua Zhang,Zijian Yi,Zhishu Shen
关键词-EN: extract higher-order relationships, data scenarios due, compared to traditional, multimodal data scenarios, increasingly utilized
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Hypergraphs are increasingly utilized in both unimodal and multimodal data scenarios due to their superior ability to model and extract higher-order relationships among nodes, compared to traditional graphs. However, current hypergraph models are encountering challenges related to imbalanced data, as this imbalance can lead to biases in the model towards the more prevalent classes. While the existing techniques, such as GraphSMOTE, have improved classification accuracy for minority samples in graph data, they still fall short when addressing the unique structure of hypergraphs. Inspired by SMOTE concept, we propose HyperSMOTE as a solution to alleviate the class imbalance issue in hypergraph learning. This method involves a two-step process: initially synthesizing minority class nodes, followed by the nodes integration into the original hypergraph. We synthesize new nodes based on samples from minority classes and their neighbors. At the same time, in order to solve the problem on integrating the new node into the hypergraph, we train a decoder based on the original hypergraph incidence matrix to adaptively associate the augmented node to hyperedges. We conduct extensive evaluation on multiple single-modality datasets, such as Cora, Cora-CA and Citeseer, as well as multimodal conversation dataset MELD to verify the effectiveness of HyperSMOTE, showing an average performance gain of 3.38% and 2.97% on accuracy, respectively.

[LG-39] Sequential Posterior Sampling with Diffusion Models

链接: https://arxiv.org/abs/2409.05399
作者: Tristan S.W. Stevens,Oisín Nolan,Jean-Luc Robert,Ruud J.G. van Sloun
关键词-EN: perform effective posterior, model complex distributions, effective posterior sampling, quickly risen, risen in popularity
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 5 pages, 4 figures, preprint

点击查看摘要

Abstract:Diffusion models have quickly risen in popularity for their ability to model complex distributions and perform effective posterior sampling. Unfortunately, the iterative nature of these generative models makes them computationally expensive and unsuitable for real-time sequential inverse problems such as ultrasound imaging. Considering the strong temporal structure across sequences of frames, we propose a novel approach that models the transition dynamics to improve the efficiency of sequential diffusion posterior sampling in conditional image synthesis. Through modeling sequence data using a video vision transformer (ViViT) transition model based on previous diffusion outputs, we can initialize the reverse diffusion trajectory at a lower noise scale, greatly reducing the number of iterations required for convergence. We demonstrate the effectiveness of our approach on a real-world dataset of high frame rate cardiac ultrasound images and show that it achieves the same performance as a full diffusion trajectory while accelerating inference 25 \times , enabling real-time posterior sampling. Furthermore, we show that the addition of a transition model improves the PSNR up to 8% in cases with severe motion. Our method opens up new possibilities for real-time applications of diffusion models in imaging and other domains requiring real-time inference.

[LG-40] Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision Language Modeling

链接: https://arxiv.org/abs/2409.05395
作者: Georgios Pantazopoulos,Malvina Nikandrou,Alessandro Suglia,Oliver Lemon,Arash Eshghi
关键词-EN: Visual Language Models, Language Models, study explores replacing, recent structured state, structured state space
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study explores replacing Transformers in Visual Language Models (VLMs) with Mamba, a recent structured state space model (SSM) that demonstrates promising performance in sequence modeling. We test models up to 3B parameters under controlled conditions, showing that Mamba-based VLMs outperforms Transformers-based VLMs in captioning, question answering, and reading comprehension. However, we find that Transformers achieve greater performance in visual grounding and the performance gap widens with scale. We explore two hypotheses to explain this phenomenon: 1) the effect of task-agnostic visual encoding on the updates of the hidden states, and 2) the difficulty in performing visual grounding from the perspective of in-context multimodal retrieval. Our results indicate that a task-aware encoding yields minimal performance gains on grounding, however, Transformers significantly outperform Mamba at in-context multimodal retrieval. Overall, Mamba shows promising performance on tasks where the correct output relies on a summary of the image but struggles when retrieval of explicit information from the context is required.

[LG-41] A Novel Representation of Periodic Pattern and Its Application to Untrained Anomaly Detection

链接: https://arxiv.org/abs/2409.05389
作者: Peng Ye,Chengyu Tao,Juan Du
关键词-EN: carbon fiber textiles, possess periodic textures, textures or surfaces, display panels, periodic pattern
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:There are a variety of industrial products that possess periodic textures or surfaces, such as carbon fiber textiles and display panels. Traditional image-based quality inspection methods for these products require identifying the periodic patterns from normal images (without anomaly and noise) and subsequently detecting anomaly pixels with inconsistent appearances. However, it remains challenging to accurately extract the periodic pattern from a single image in the presence of unknown anomalies and measurement noise. To deal with this challenge, this paper proposes a novel self-representation of the periodic image defined on a set of continuous parameters. In this way, periodic pattern learning can be embedded into a joint optimization framework, which is named periodic-sparse decomposition, with simultaneously modeling the sparse anomalies and Gaussian noise. Finally, for the real-world industrial images that may not strictly satisfy the periodic assumption, we propose a novel pixel-level anomaly scoring strategy to enhance the performance of anomaly detection. Both simulated and real-world case studies demonstrate the effectiveness of the proposed methodology for periodic pattern learning and anomaly detection.

[LG-42] BAMDP Shaping: a Unified Theoretical Framework for Intrinsic Motivation and Reward Shaping

链接: https://arxiv.org/abs/2409.05358
作者: Aly Lidayan,Michael Dennis,Stuart Russell
关键词-EN: Intrinsic motivation, Markov Decision Processes, reinforcement learning, agents by adding, common methods
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Intrinsic motivation (IM) and reward shaping are common methods for guiding the exploration of reinforcement learning (RL) agents by adding pseudo-rewards. Designing these rewards is challenging, however, and they can counter-intuitively harm performance. To address this, we characterize them as reward shaping in Bayes-Adaptive Markov Decision Processes (BAMDPs), which formalizes the value of exploration by formulating the RL process as updating a prior over possible MDPs through experience. RL algorithms can be viewed as BAMDP policies; instead of attempting to find optimal algorithms by solving BAMDPs directly, we use it at a theoretical framework for understanding how pseudo-rewards guide suboptimal algorithms. By decomposing BAMDP state value into the value of the information collected plus the prior value of the physical state, we show how psuedo-rewards can help by compensating for RL algorithms’ misestimation of these two terms, yielding a new typology of IM and reward shaping approaches. We carefully extend the potential-based shaping theorem to BAMDPs to prove that when pseudo-rewards are BAMDP Potential-based shaping Functions (BAMPFs), they preserve optimal, or approximately optimal, behavior of RL algorithms; otherwise, they can corrupt even optimal learners. We finally give guidance on how to design or convert existing pseudo-rewards to BAMPFs by expressing assumptions about the environment as potential functions on BAMDP states.

[LG-43] Attention Based Machine Learning Methods for Data Reduction with Guaranteed Error Bounds

链接: https://arxiv.org/abs/2409.05357
作者: Xiao Li,Jaemoon Lee,Anand Rangarajan,Sanjay Ranka
关键词-EN: high energy physics, computational fluid dynamics, climate science generate, science generate vast, generate vast amounts
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Scientific applications in fields such as high energy physics, computational fluid dynamics, and climate science generate vast amounts of data at high velocities. This exponential growth in data production is surpassing the advancements in computing power, network capabilities, and storage capacities. To address this challenge, data compression or reduction techniques are crucial. These scientific datasets have underlying data structures that consist of structured and block structured multidimensional meshes where each grid point corresponds to a tensor. It is important that data reduction techniques leverage strong spatial and temporal correlations that are ubiquitous in these applications. Additionally, applications such as CFD, process tensors comprising hundred plus species and their attributes at each grid point. Reduction techniques should be able to leverage interrelationships between the elements in each tensor. In this paper, we propose an attention-based hierarchical compression method utilizing a block-wise compression setup. We introduce an attention-based hyper-block autoencoder to capture inter-block correlations, followed by a block-wise encoder to capture block-specific information. A PCA-based post-processing step is employed to guarantee error bounds for each data block. Our method effectively captures both spatiotemporal and inter-variable correlations within and between data blocks. Compared to the state-of-the-art SZ3, our method achieves up to 8 times higher compression ratio on the multi-variable S3D dataset. When evaluated on single-variable setups using the E3SM and XGC datasets, our method still achieves up to 3 times and 2 times higher compression ratio, respectively.

[LG-44] IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS

链接: https://arxiv.org/abs/2409.05356
作者: Ashwin Sankar,Srija Anand,Praveen Srinivasa Varadhan,Sherry Thomas,Mehak Singal,Shridhar Kumar,Deovrat Mehendale,Aditi Krishana,Giri Raju,Mitesh Khapra
关键词-EN: highly natural-sounding output, Recent advancements, produce highly natural-sounding, Indian, Indian languages
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Recent advancements in text-to-speech (TTS) synthesis show that large-scale models trained with extensive web data produce highly natural-sounding output. However, such data is scarce for Indian languages due to the lack of high-quality, manually subtitled data on platforms like LibriVox or YouTube. To address this gap, we enhance existing large-scale ASR datasets containing natural conversations collected in low-quality environments to generate high-quality TTS training data. Our pipeline leverages the cross-lingual generalization of denoising and speech enhancement models trained on English and applied to Indian languages. This results in IndicVoices-R (IV-R), the largest multilingual Indian TTS dataset derived from an ASR dataset, with 1,704 hours of high-quality speech from 10,496 speakers across 22 Indian languages. IV-R matches the quality of gold-standard TTS datasets like LJSpeech, LibriTTS, and IndicTTS. We also introduce the IV-R Benchmark, the first to assess zero-shot, few-shot, and many-shot speaker generalization capabilities of TTS models on Indian voices, ensuring diversity in age, gender, and style. We demonstrate that fine-tuning an English pre-trained model on a combined dataset of high-quality IndicTTS and our IV-R dataset results in better zero-shot speaker generalization compared to fine-tuning on the IndicTTS dataset alone. Further, our evaluation reveals limited zero-shot generalization for Indian voices in TTS models trained on prior datasets, which we improve by fine-tuning the model on our data containing diverse set of speakers across language families. We open-source all data and code, releasing the first TTS model for all 22 official Indian languages.

[LG-45] On the Convergence Analysis of Over-Parameterized Variational Autoencoders: A Neural Tangent Kernel Perspective

链接: https://arxiv.org/abs/2409.05349
作者: Li Wang,Wei Huang
关键词-EN: Variational Auto-Encoders, powerful probabilistic models, Stochastic Neural Network, emerged as powerful, powerful probabilistic
类目: Machine Learning (cs.LG)
*备注: Accepted by Machine Learning journal

点击查看摘要

Abstract:Variational Auto-Encoders (VAEs) have emerged as powerful probabilistic models for generative tasks. However, their convergence properties have not been rigorously proven. The challenge of proving convergence is inherently difficult due to the highly non-convex nature of the training objective and the implementation of a Stochastic Neural Network (SNN) within VAE architectures. This paper addresses these challenges by characterizing the optimization trajectory of SNNs utilized in VAEs through the lens of Neural Tangent Kernel (NTK) techniques. These techniques govern the optimization and generalization behaviors of ultra-wide neural networks. We provide a mathematical proof of VAE convergence under mild assumptions, thus advancing the theoretical understanding of VAE optimization dynamics. Furthermore, we establish a novel connection between the optimization problem faced by over-parameterized SNNs and the Kernel Ridge Regression (KRR) problem. Our findings not only contribute to the theoretical foundation of VAEs but also open new avenues for investigating the optimization of generative models using advanced kernel methods. Our theoretical claims are verified by experimental simulations.

[LG-46] riplePlay: Enhancing Federated Learning with CLIP for Non-IID Data and Resource Efficiency

链接: https://arxiv.org/abs/2409.05347
作者: Ahmed Imteaj,Md Zarif Hossain,Saika Zaman,Abdur R. Shahid
关键词-EN: offer significant opportunities, privacy-preserving artificial intelligence, Federated Learning, offer significant, artificial intelligence
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid advancement and increasing complexity of pretrained models, exemplified by CLIP, offer significant opportunities as well as challenges for Federated Learning (FL), a critical component of privacy-preserving artificial intelligence. This research delves into the intricacies of integrating large foundation models like CLIP within FL frameworks to enhance privacy, efficiency, and adaptability across heterogeneous data landscapes. It specifically addresses the challenges posed by non-IID data distributions, the computational and communication overheads of leveraging such complex models, and the skewed representation of classes within datasets. We propose TriplePlay, a framework that integrates CLIP as an adapter to enhance FL’s adaptability and performance across diverse data distributions. This approach addresses the long-tail distribution challenge to ensure fairness while reducing resource demands through quantization and low-rank adaptation techniques.Our simulation results demonstrate that TriplePlay effectively decreases GPU usage costs and speeds up the learning process, achieving convergence with reduced communication overhead.

[LG-47] GDFlow: Anomaly Detection with NCDE-based Normalizing Flow for Advanced Driver Assistance System

链接: https://arxiv.org/abs/2409.05346
作者: Kangjun Lee,Minha Kim,Youngho Jun,Simon S. Woo
关键词-EN: Adaptive Cruise Control, Driver Assistance Systems, Advanced Driver Assistance, Cruise Control, Assistance Systems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:For electric vehicles, the Adaptive Cruise Control (ACC) in Advanced Driver Assistance Systems (ADAS) is designed to assist braking based on driving conditions, road inclines, predefined deceleration strengths, and user braking patterns. However, the driving data collected during the development of ADAS are generally limited and lack diversity. This deficiency leads to late or aggressive braking for different users. Crucially, it is necessary to effectively identify anomalies, such as unexpected or inconsistent braking patterns in ADAS, especially given the challenge of working with unlabelled, limited, and noisy datasets from real-world electric vehicles. In order to tackle the aforementioned challenges in ADAS, we propose Graph Neural Controlled Differential Equation Normalizing Flow (GDFlow), a model that leverages Normalizing Flow (NF) with Neural Controlled Differential Equations (NCDE) to learn the distribution of normal driving patterns continuously. Compared to the traditional clustering or anomaly detection algorithms, our approach effectively captures the spatio-temporal information from different sensor data and more accurately models continuous changes in driving patterns. Additionally, we introduce a quantile-based maximum likelihood objective to improve the likelihood estimate of the normal data near the boundary of the distribution, enhancing the model’s ability to distinguish between normal and anomalous patterns. We validate GDFlow using real-world electric vehicle driving data that we collected from Hyundai IONIQ5 and GV80EV, achieving state-of-the-art performance compared to six baselines across four dataset configurations of different vehicle types and drivers. Furthermore, our model outperforms the latest anomaly detection methods across four time series benchmark datasets. Our approach demonstrates superior efficiency in inference time compared to existing methods.

[LG-48] Graffin: Stand for Tails in Imbalanced Node Classification

链接: https://arxiv.org/abs/2409.05339
作者: Xiaorui Qi,Yanlong Wen,Xiaojie Yuan
关键词-EN: tail data, tail, GRL, data, Graph representation learning
类目: Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

Abstract:Graph representation learning (GRL) models have succeeded in many scenarios. Real-world graphs have imbalanced distribution, such as node labels and degrees, which leaves a critical challenge to GRL. Imbalanced inputs can lead to imbalanced outputs. However, most existing works ignore it and assume that the distribution of input graphs is balanced, which cannot align with real situations, resulting in worse model performance on tail data. The domination of head data makes tail data underrepresented when training graph neural networks (GNNs). Thus, we propose Graffin, a pluggable tail data augmentation module, to address the above issues. Inspired by recurrent neural networks (RNNs), Graffin flows head features into tail data through graph serialization techniques to alleviate the imbalance of tail representation. The local and global structures are fused to form the node representation under the combined effect of neighborhood and sequence information, which enriches the semantics of tail data. We validate the performance of Graffin on four real-world datasets in node classification tasks. Results show that Graffin can improve the adaptation to tail data without significantly degrading the overall model performance.

[LG-49] A Multi-Modal Deep Learning Based Approach for House Price Prediction

链接: https://arxiv.org/abs/2409.05335
作者: Md Hasebul Hasan,Md Abid Jahan,Mohammed Eunus Ali,Yuan-Fang Li,Timos Sellis
关键词-EN: house, house price, house price prediction, real estate sector, residential real estate
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 22 pages

点击查看摘要

Abstract:Accurate prediction of house price, a vital aspect of the residential real estate sector, is of substantial interest for a wide range of stakeholders. However, predicting house prices is a complex task due to the significant variability influenced by factors such as house features, location, neighborhood, and many others. Despite numerous attempts utilizing a wide array of algorithms, including recent deep learning techniques, to predict house prices accurately, existing approaches have fallen short of considering a wide range of factors such as textual and visual features. This paper addresses this gap by comprehensively incorporating attributes, such as features, textual descriptions, geo-spatial neighborhood, and house images, typically showcased in real estate listings in a house price prediction system. Specifically, we propose a multi-modal deep learning approach that leverages different types of data to learn more accurate representation of the house. In particular, we learn a joint embedding of raw house attributes, geo-spatial neighborhood, and most importantly from textual description and images representing the house; and finally use a downstream regression model to predict the house price from this jointly learned embedding vector. Our experimental results with a real-world dataset show that the text embedding of the house advertisement description and image embedding of the house pictures in addition to raw attributes and geo-spatial embedding, can significantly improve the house price prediction accuracy. The relevant source code and dataset are publicly accessible at the following URL: this https URL

[LG-50] ICPR 2024 Competition on Safe Segmentation of Drive Scenes in Unstructured Traffic and Adverse Weather Conditions ICPR

链接: https://arxiv.org/abs/2409.05327
作者: Furqan Ahmed Shaik,Sandeep Nagar,Aiswarya Maturi,Harshit Kumar Sankhla,Dibyendu Ghosh,Anshuman Majumdar,Srikanth Vidapanakal,Kunal Chaudhary,Sunny Manchanda,Girish Varma
关键词-EN: Adverse Weather Conditions, Weather Conditions served, Adverse Weather, Weather Conditions, Scenes in Unstructured
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 15 pages, 7 figures, ICPR Competition Paper

点击查看摘要

Abstract:The ICPR 2024 Competition on Safe Segmentation of Drive Scenes in Unstructured Traffic and Adverse Weather Conditions served as a rigorous platform to evaluate and benchmark state-of-the-art semantic segmentation models under challenging conditions for autonomous driving. Over several months, participants were provided with the IDD-AW dataset, consisting of 5000 high-quality RGB-NIR image pairs, each annotated at the pixel level and captured under adverse weather conditions such as rain, fog, low light, and snow. A key aspect of the competition was the use and improvement of the Safe mean Intersection over Union (Safe mIoU) metric, designed to penalize unsafe incorrect predictions that could be overlooked by traditional mIoU. This innovative metric emphasized the importance of safety in developing autonomous driving systems. The competition showed significant advancements in the field, with participants demonstrating models that excelled in semantic segmentation and prioritized safety and robustness in unstructured and adverse conditions. The results of the competition set new benchmarks in the domain, highlighting the critical role of safety in deploying autonomous vehicles in real-world scenarios. The contributions from this competition are expected to drive further innovation in autonomous driving technology, addressing the critical challenges of operating in diverse and unpredictable environments.

[LG-51] Sample-Efficient Bayesian Optimization with Transfer Learning for Heterogeneous Search Spaces

链接: https://arxiv.org/abs/2409.05325
作者: Aryan Deshwal,Sait Cakmak,Yuhou Xia,David Eriksson
关键词-EN: Bayesian optimization, sample-efficient optimization, search spaces, Bayesian, black-box functions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Bayesian optimization (BO) is a powerful approach to sample-efficient optimization of black-box functions. However, in settings with very few function evaluations, a successful application of BO may require transferring information from historical experiments. These related experiments may not have exactly the same tunable parameters (search spaces), motivating the need for BO with transfer learning for heterogeneous search spaces. In this paper, we propose two methods for this setting. The first approach leverages a Gaussian process (GP) model with a conditional kernel to transfer information between different search spaces. Our second approach treats the missing parameters as hyperparameters of the GP model that can be inferred jointly with the other GP hyperparameters or set to fixed values. We show that these two methods perform well on several benchmark problems.

[LG-52] -LLMs: A Series of Specialized Large Language Models for Telecommunications

链接: https://arxiv.org/abs/2409.05314
作者: Ali Maatouk,Kenny Chirino Ampudia,Rex Ying,Leandros Tassiulas
关键词-EN: natural language processing, impacted various fields, medicine and finance, emergence of large, significantly impacted
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The emergence of large language models (LLMs) has significantly impacted various fields, from natural language processing to sectors like medicine and finance. However, despite their rapid proliferation, the applications of LLMs in telecommunications remain limited, often relying on general-purpose models that lack domain-specific specialization. This lack of specialization results in underperformance, particularly when dealing with telecommunications-specific technical terminology and their associated mathematical representations. This paper addresses this gap by first creating and disseminating Tele-Data, a comprehensive dataset of telecommunications material curated from relevant sources, and Tele-Eval, a large-scale question-and-answer dataset tailored to the domain. Through extensive experiments, we explore the most effective training techniques for adapting LLMs to the telecommunications domain, ranging from examining the division of expertise across various telecommunications aspects to employing parameter-efficient techniques. We also investigate how models of different sizes behave during adaptation and analyze the impact of their training data on this behavior. Leveraging these findings, we develop and open-source Tele-LLMs, the first series of language models ranging from 1B to 8B parameters, specifically tailored for telecommunications. Our evaluations demonstrate that these models outperform their general-purpose counterparts on Tele-Eval while retaining their previously acquired capabilities, thus avoiding the catastrophic forgetting phenomenon.

[LG-53] Closed-Form Interpretation of Neural Network Latent Spaces with Symbolic Gradients

链接: https://arxiv.org/abs/2409.05305
作者: Zakaria Patel,Sebastian J. Wetzel
关键词-EN: neural networks, scientific fields, artificial neural networks, latent spaces, neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:It has been demonstrated in many scientific fields that artificial neural networks like autoencoders or Siamese networks encode meaningful concepts in their latent spaces. However, there does not exist a comprehensive framework for retrieving this information in a human-readable form without prior knowledge. In order to extract these concepts, we introduce a framework for finding closed-form interpretations of neurons in latent spaces of artificial neural networks. The interpretation framework is based on embedding trained neural networks into an equivalence class of functions that encode the same concept. We interpret these neural networks by finding an intersection between the equivalence class and human-readable equations defined by a symbolic search space. The approach is demonstrated by retrieving invariants of matrices and conserved quantities of dynamical systems from latent spaces of Siamese neural networks.

[LG-54] Resource-Efficient Generative AI Model Deployment in Mobile Edge Networks

链接: https://arxiv.org/abs/2409.05303
作者: Yuxin Liang,Peng Yang,Yuanyuan He,Feng Lyu
关键词-EN: Artificial Intelligence-Generated Content, development of Artificial, Artificial Intelligence-Generated, Intelligence-Generated Content, content creation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The surging development of Artificial Intelligence-Generated Content (AIGC) marks a transformative era of the content creation and production. Edge servers promise attractive benefits, e.g., reduced service delay and backhaul traffic load, for hosting AIGC services compared to cloud-based solutions. However, the scarcity of available resources on the edge pose significant challenges in deploying generative AI models. In this paper, by characterizing the resource and delay demands of typical generative AI models, we find that the consumption of storage and GPU memory, as well as the model switching delay represented by I/O delay during the preloading phase, are significant and vary across models. These multidimensional coupling factors render it difficult to make efficient edge model deployment decisions. Hence, we present a collaborative edge-cloud framework aiming to properly manage generative AI model deployment on the edge. Specifically, we formulate edge model deployment problem considering heterogeneous features of models as an optimization problem, and propose a model-level decision selection algorithm to solve it. It enables pooled resource sharing and optimizes the trade-off between resource consumption and delay in edge generative AI model deployment. Simulation results validate the efficacy of the proposed algorithm compared with baselines, demonstrating its potential to reduce overall costs by providing feature-aware model deployment decisions.

[LG-55] ERD: A Unified Framework for Safeguarding Diffusion Models Against Backdoors

链接: https://arxiv.org/abs/2409.05294
作者: Yichuan Mo,Hui Huang,Mingjie Li,Ang Li,Yisen Wang
关键词-EN: achieved notable success, remain highly vulnerable, producing specific undesirable, specific undesirable outputs, image generation
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have achieved notable success in image generation, but they remain highly vulnerable to backdoor attacks, which compromise their integrity by producing specific undesirable outputs when presented with a pre-defined trigger. In this paper, we investigate how to protect diffusion models from this dangerous threat. Specifically, we propose TERD, a backdoor defense framework that builds unified modeling for current attacks, which enables us to derive an accessible reversed loss. A trigger reversion strategy is further employed: an initial approximation of the trigger through noise sampled from a prior distribution, followed by refinement through differential multi-step samplers. Additionally, with the reversed trigger, we propose backdoor detection from the noise space, introducing the first backdoor input detection approach for diffusion models and a novel model detection algorithm that calculates the KL divergence between reversed and benign distributions. Extensive evaluations demonstrate that TERD secures a 100% True Positive Rate (TPR) and True Negative Rate (TNR) across datasets of varying resolutions. TERD also demonstrates nice adaptability to other Stochastic Differential Equation (SDE)-based models. Our code is available at this https URL.

[LG-56] Mpox Narrative on Instagram: A Labeled Multilingual Dataset of Instagram Posts on Mpox for Sentiment Hate Speech and Anxiety Analysis

链接: https://arxiv.org/abs/2409.05292
作者: Nirmalya Thakur
关键词-EN: Public Health Emergency, Public Health, Health Emergency, Emergency of International, International Concern
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:The world is currently experiencing an outbreak of mpox, which has been declared a Public Health Emergency of International Concern by WHO. No prior work related to social media mining has focused on the development of a dataset of Instagram posts about the mpox outbreak. The work presented in this paper aims to address this research gap and makes two scientific contributions to this field. First, it presents a multilingual dataset of 60,127 Instagram posts about mpox, published between July 23, 2022, and September 5, 2024. The dataset, available at this https URL, contains Instagram posts about mpox in 52 languages. For each of these posts, the Post ID, Post Description, Date of publication, language, and translated version of the post (translation to English was performed using the Google Translate API) are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis, hate speech detection, and anxiety or stress detection were performed. This process included classifying each post into (i) one of the sentiment classes, i.e., fear, surprise, joy, sadness, anger, disgust, or neutral, (ii) hate or not hate, and (iii) anxiety/stress detected or no anxiety/stress detected. These results are presented as separate attributes in the dataset. Second, this paper presents the results of performing sentiment analysis, hate speech analysis, and anxiety or stress analysis. The variation of the sentiment classes - fear, surprise, joy, sadness, anger, disgust, and neutral were observed to be 27.95%, 2.57%, 8.69%, 5.94%, 2.69%, 1.53%, and 50.64%, respectively. In terms of hate speech detection, 95.75% of the posts did not contain hate and the remaining 4.25% of the posts contained hate. Finally, 72.05% of the posts did not indicate any anxiety/stress, and the remaining 27.95% of the posts represented some form of anxiety/stress.

[LG-57] owards Fast Rates for Federated and Multi-Task Reinforcement Learning

链接: https://arxiv.org/abs/2409.05291
作者: Feng Zhu,Robert W. Heath Jr.,Aritra Mitra
关键词-EN: Markov Decision Process, Decision Process, Markov Decision, setting involving, agents’ MDPs differ
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: Accepted to the Decision and Control Conference (CDC), 2024

点击查看摘要

Abstract:We consider a setting involving N agents, where each agent interacts with an environment modeled as a Markov Decision Process (MDP). The agents’ MDPs differ in their reward functions, capturing heterogeneous objectives/tasks. The collective goal of the agents is to communicate intermittently via a central server to find a policy that maximizes the average of long-term cumulative rewards across environments. The limited existing work on this topic either only provide asymptotic rates, or generate biased policies, or fail to establish any benefits of collaboration. In response, we propose Fast-FedPG - a novel federated policy gradient algorithm with a carefully designed bias-correction mechanism. Under a gradient-domination condition, we prove that our algorithm guarantees (i) fast linear convergence with exact gradients, and (ii) sub-linear rates that enjoy a linear speedup w.r.t. the number of agents with noisy, truncated policy gradients. Notably, in each case, the convergence is to a globally optimal policy with no heterogeneity-induced bias. In the absence of gradient-domination, we establish convergence to a first-order stationary point at a rate that continues to benefit from collaboration.

[LG-58] Efficiently Learning Markov Random Fields from Dynamics

链接: https://arxiv.org/abs/2409.05284
作者: Jason Gaitonde,Ankur Moitra,Elchanan Mossel
关键词-EN: Markov random field, Markov random, undirected graphical model, random field, important task
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注: 40 pages, 3 figures

点击查看摘要

Abstract:An important task in high-dimensional statistics is learning the parameters or dependency structure of an undirected graphical model, or Markov random field (MRF). Much of the prior work on this problem assumes access to i.i.d. samples from the MRF distribution and state-of-the-art algorithms succeed using n^\Theta(k) runtime, where n is the dimension and k is the order of the interactions. However, well-known reductions from the sparse parity with noise problem imply that given i.i.d. samples from a sparse, order- k MRF, any learning algorithm likely requires n^\Omega(k) time, impeding the potential for significant computational improvements. In this work, we demonstrate that these fundamental barriers for learning MRFs can surprisingly be completely circumvented when learning from natural, dynamical samples. We show that in bounded-degree MRFs, the dependency structure and parameters can be recovered using a trajectory of Glauber dynamics of length O(n \log n) with runtime O(n^2 \log n) . The implicit constants depend only on the degree and non-degeneracy parameters of the model, but not the dimension n . In particular, learning MRFs from dynamics is \textitprovably computationally easier than learning from i.i.d. samples under standard hardness assumptions.

[LG-59] Learning Submodular Sequencing from Samples

链接: https://arxiv.org/abs/2409.05265
作者: Jing Yuan,Shaojie Tang
关键词-EN: sequential submodular maximization, paper addresses, addresses the problem, problem of sequential, optimize some composite
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper addresses the problem of sequential submodular maximization: selecting and ranking items in a sequence to optimize some composite submodular function. In contrast to most of the previous works, which assume access to the utility function, we assume that we are given only a set of samples. Each sample includes a random sequence of items and its associated utility. We present an algorithm that, given polynomially many samples drawn from a two-stage uniform distribution, achieves an approximation ratio dependent on the curvature of individual submodular functions. Our results apply in a wide variety of real-world scenarios, such as ranking products in online retail platforms, where complete knowledge of the utility function is often impossible to obtain. Our algorithm gives an empirically useful solution in such contexts, thus proving that limited data can be of great use in sequencing tasks. From a technical perspective, our results extend prior work on ``optimization from samples’’ by generalizing from optimizing a set function to a sequence-dependent function.

[LG-60] owards Automated Machine Learning Research

链接: https://arxiv.org/abs/2409.05258
作者: Shervin Ardeshir
关键词-EN: Large Language Models, Large Language, automating incremental advances, machine learning research, facilitated by Large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper explores a top-down approach to automating incremental advances in machine learning research through component-level innovation, facilitated by Large Language Models (LLMs). Our framework systematically generates novel components, validates their feasibility, and evaluates their performance against existing baselines. A key distinction of this approach lies in how these novel components are generated. Unlike traditional AutoML and NAS methods, which often rely on a bottom-up combinatorial search over predefined, hardcoded base components, our method leverages the cross-domain knowledge embedded in LLMs to propose new components that may not be confined to any hard-coded predefined set. By incorporating a reward model to prioritize promising hypotheses, we aim to improve the efficiency of the hypothesis generation and evaluation process. We hope this approach offers a new avenue for exploration and contributes to the ongoing dialogue in the field.

[LG-61] BBS: Bi-directional Bit-level Sparsity for Deep Learning Acceleration MICRO2024

链接: https://arxiv.org/abs/2409.05227
作者: Yuzong Chen,Jian Meng,Jae-sun Seo,Mohamed S. Abdelfattah
关键词-EN: ineffectual zero-bit operations, deep learning accelerators, bit-serial deep learning, typically applicable, BBS
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: Accepted by IEEE/ACM MICRO 2024

点击查看摘要

Abstract:Bit-level sparsity methods skip ineffectual zero-bit operations and are typically applicable within bit-serial deep learning accelerators. This type of sparsity at the bit-level is especially interesting because it is both orthogonal and compatible with other deep neural network (DNN) efficiency methods such as quantization and pruning. In this work, we improve the practicality and efficiency of bitlevel sparsity through a novel algorithmic bit-pruning, averaging, and compression method, and a co-designed efficient bit-serial hardware accelerator. On the algorithmic side, we introduce bidirectional bit sparsity (BBS). The key insight of BBS is that we can leverage bit sparsity in a symmetrical way to prune either zero-bits or one-bits. This significantly improves the load balance of bit-serial computing and guarantees the level of sparsity to be more than 50%. On top of BBS, we further propose two bit-level binary pruning methods that require no retraining, and can be seamlessly applied to quantized DNNs. Combining binary pruning with a new tensor encoding scheme, BBS can both skip computation and reduce the memory footprint associated with bi-directional sparse bit columns. On the hardware side, we demonstrate the potential of BBS through BitVert, a bitserial architecture with an efficient PE design to accelerate DNNs with low overhead, exploiting our proposed binary pruning. Evaluation on seven representative DNN models shows that our approach achieves: (1) on average 1.66 \times reduction in model sizewith negligible accuracy loss of 0.5%; (2) up to 3.03 \times speedupand 2.44 \times energy saving compared to prior DNN accelerators.

[LG-62] Synthetic Tabular Data Generation for Class Imbalance and Fairness: A Comparative Study ECML KDD2024

链接: https://arxiv.org/abs/2409.05215
作者: Emmanouil Panagiotou,Arjun Roy,Eirini Ntoutsi
关键词-EN: Machine Learning, data-driven nature, group imbalances, class and group, group
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at the ECML PKDD 2024, 4th Workshop on Bias and Fairness in AI

点击查看摘要

Abstract:Due to their data-driven nature, Machine Learning (ML) models are susceptible to bias inherited from data, especially in classification problems where class and group imbalances are prevalent. Class imbalance (in the classification target) and group imbalance (in protected attributes like sex or race) can undermine both ML utility and fairness. Although class and group imbalances commonly coincide in real-world tabular datasets, limited methods address this scenario. While most methods use oversampling techniques, like interpolation, to mitigate imbalances, recent advancements in synthetic tabular data generation offer promise but have not been adequately explored for this purpose. To this end, this paper conducts a comparative analysis to address class and group imbalances using state-of-the-art models for synthetic tabular data generation and various sampling strategies. Experimental results on four datasets, demonstrate the effectiveness of generative models for bias mitigation, creating opportunities for further exploration in this direction.

[LG-63] ICML Topological Deep Learning Challenge 2024: Beyond the Graph Domain ICML2024

链接: https://arxiv.org/abs/2409.05211
作者: Guillermo Bernárdez,Lev Telyatnikov,Marco Montagna,Federica Baccini,Mathilde Papillon,Miquel Ferriol-Galmés,Mustafa Hajij,Theodore Papamarkou,Maria Sofia Bucarelli,Olga Zaghen,Johan Mathe,Audun Myers,Scott Mahan,Hansen Lillemark,Sharvaree Vadgama,Erik Bekkers,Tim Doster,Tegan Emerson,Henry Kvinge,Katrina Agate,Nesreen K Ahmed,Pengfei Bai,Michael Banf,Claudio Battiloro,Maxim Beketov,Paul Bogdan,Martin Carrasco,Andrea Cavallo,Yun Young Choi,George Dasoulas,Matouš Elphick,Giordan Escalona,Dominik Filipiak,Halley Fritze,Thomas Gebhart,Manel Gil-Sorribes,Salvish Goomanee,Victor Guallar,Liliya Imasheva,Andrei Irimia,Hongwei Jin,Graham Johnson,Nikos Kanakaris,Boshko Koloski,Veljko Kovač,Manuel Lecha,Minho Lee,Pierrick Leroy,Theodore Long,German Magai,Alvaro Martinez,Marissa Masden,Sebastian Mežnar,Bertran Miquel-Oliver,Alexis Molina,Alexander Nikitin,Marco Nurisso,Matt Piekenbrock,Yu Qin,Patryk Rygiel,Alessandro Salatiello,Max Schattauer,Pavel Snopov,Julian Suk,Valentina Sánchez,Mauricio Tec,Francesco Vaccarino,Jonas Verhellen,Frederic Wantiez,Alexander Weers,Patrik Zajec,Blaž Škrlj,Nina Miolane
关键词-EN: Geometry-grounded Representation Learning, ICML Topological Deep, Topological Deep Learning, Deep Learning Challenge, ELLIS Workshop
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Proceedings of the Geometry-grounded Representation Learning and Generative Modeling Workshop (GRaM) at ICML 2024

点击查看摘要

Abstract:This paper describes the 2nd edition of the ICML Topological Deep Learning Challenge that was hosted within the ICML 2024 ELLIS Workshop on Geometry-grounded Representation Learning and Generative Modeling (GRaM). The challenge focused on the problem of representing data in different discrete topological domains in order to bridge the gap between Topological Deep Learning (TDL) and other types of structured datasets (e.g. point clouds, graphs). Specifically, participants were asked to design and implement topological liftings, i.e. mappings between different data structures and topological domains --like hypergraphs, or simplicial/cell/combinatorial complexes. The challenge received 52 submissions satisfying all the requirements. This paper introduces the main scope of the challenge, and summarizes the main results and findings.

[LG-64] Influence-based Attributions can be Manipulated

链接: https://arxiv.org/abs/2409.05208
作者: Chhavi Yadav,Ruihan Wu,Kamalika Chaudhuri
关键词-EN: Influence Functions, valuation and fairness, standard tool, tool for attributing, attributing predictions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Influence Functions are a standard tool for attributing predictions to training data in a principled manner and are widely used in applications such as data valuation and fairness. In this work, we present realistic incentives to manipulate influencebased attributions and investigate whether these attributions can be systematically tampered by an adversary. We show that this is indeed possible and provide efficient attacks with backward-friendly implementations. Our work raises questions on the reliability of influence-based attributions under adversarial circumstances.

[LG-65] Low Latency Transformer Inference on FPGAs for Physics Applications with hls4ml

链接: https://arxiv.org/abs/2409.05207
作者: Zhixing Jiang,Dennis Yin,Yihui Chen,Elham E Khoda,Scott Hauck,Shih-Chieh Hsu,Ekaterina Govorkova,Philip Harris,Vladimir Loncar,Eric A. Moreno
关键词-EN: Field-Programmable Gate Arrays, Gate Arrays, Field-Programmable Gate, study presents, presents an efficient
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study presents an efficient implementation of transformer architectures in Field-Programmable Gate Arrays(FPGAs) using hls4ml. We demonstrate the strategy for implementing the multi-head attention, softmax, and normalization layer and evaluate three distinct models. Their deployment on VU13P FPGA chip achieved latency less than 2us, demonstrating the potential for real-time applications. HLS4ML compatibility with any TensorFlow-built transformer model further enhances the scalability and applicability of this work. Index Terms: FPGAs, machine learning, transformers, high energy physics, LIGO

[LG-66] SEF: A Method for Computing Prediction Intervals by Shifting the Error Function in Neural Networks

链接: https://arxiv.org/abs/2409.05206
作者: E. V. Aretos,D. G. Sotiropoulos
关键词-EN: Neural Networks, neural network predictions, today era, scientific fields, Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: The paper has been accepted at the 2024 International Conference on Computer and Applications (ICCA24), Cairo, Egypt, December 17-19, 2024. this https URL

点击查看摘要

Abstract:In today’s era, Neural Networks (NN) are applied in various scientific fields such as robotics, medicine, engineering, etc. However, the predictions of neural networks themselves contain a degree of uncertainty that must always be taken into account before any decision is made. This is why many researchers have focused on developing different ways to quantify the uncertainty of neural network predictions. Some of these methods are based on generating prediction intervals (PI) via neural networks for the requested target values. The SEF (Shifting the Error Function) method presented in this paper is a new method that belongs to this category of methods. The proposed approach involves training a single neural network three times, thus generating an estimate along with the corresponding upper and lower bounds for a given problem. A pivotal aspect of the method is the calculation of a parameter from the initial network’s estimates, which is then integrated into the loss functions of the other two networks. This innovative process effectively produces PIs, resulting in a robust and efficient technique for uncertainty quantification. To evaluate the effectiveness of our method, a comparison in terms of successful PI generation between the SEF, PI3NN and PIVEN methods was made using two synthetic datasets.

[LG-67] A Survey on Mixup Augmentations and Beyond

链接: https://arxiv.org/abs/2409.05202
作者: Xin Jin,Hongyu Zhu,Siyuan Li,Zedong Wang,Zicheng Liu,Chang Yu,Huafeng Qin,Stan Z. Li
关键词-EN: Deep Neural Networks, Deep Neural, Neural Networks, achieved thrilling breakthroughs, garnered increasing attention
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint V1 with 27 pages main text. Online project at this https URL

点击查看摘要

Abstract:As Deep Neural Networks have achieved thrilling breakthroughs in the past decade, data augmentations have garnered increasing attention as regularization techniques when massive labeled data are unavailable. Among existing augmentations, Mixup and relevant data-mixing methods that convexly combine selected samples and the corresponding labels are widely adopted because they yield high performances by generating data-dependent virtual data while easily migrating to various domains. This survey presents a comprehensive review of foundational mixup methods and their applications. We first elaborate on the training pipeline with mixup augmentations as a unified framework containing modules. A reformulated framework could contain various mixup methods and give intuitive operational procedures. Then, we systematically investigate the applications of mixup augmentations on vision downstream tasks, various data modalities, and some analysis \ theorems of mixup. Meanwhile, we conclude the current status and limitations of mixup research and point out further work for effective and efficient mixup augmentations. This survey can provide researchers with the current state of the art in mixup methods and provide some insights and guidance roles in the mixup arena. An online project with this survey is available at \urlthis https URL.

[LG-68] Lung-DETR: Deformable Detection Transformer for Sparse Lung Nodule Anomaly Detection

链接: https://arxiv.org/abs/2409.05200
作者: Hooman Ramezani,Dionne Aleman,Daniel Létourneau
关键词-EN: Accurate lung nodule, real-world settings due, Accurate lung, computed tomography, scan imagery
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate lung nodule detection for computed tomography (CT) scan imagery is challenging in real-world settings due to the sparse occurrence of nodules and similarity to other anatomical structures. In a typical positive case, nodules may appear in as few as 3% of CT slices, complicating detection. To address this, we reframe the problem as an anomaly detection task, targeting rare nodule occurrences in a predominantly normal dataset. We introduce a novel solution leveraging custom data preprocessing and Deformable Detection Transformer (Deformable- DETR). A 7.5mm Maximum Intensity Projection (MIP) is utilized to combine adjacent lung slices into single images, reducing the slice count and decreasing nodule sparsity. This enhances spatial context, allowing for better differentiation between nodules and other structures such as complex vascular structures and bronchioles. Deformable-DETR is employed to detect nodules, with a custom focal loss function to better handle the imbalanced dataset. Our model achieves state-of-the-art performance on the LUNA16 dataset with an F1 score of 94.2% (95.2% recall, 93.3% precision) on a dataset sparsely populated with lung nodules that is reflective of real-world clinical data.

[LG-69] Exploring Fungal Morphology Simulation and Dynamic Light Containment from a Graphics Generation Perspective SIGGRAPH

链接: https://arxiv.org/abs/2409.05171
作者: Kexin Wang,Ivy He,Jinke Li,Ali Asadipour,Yitong Sun
关键词-EN: considered crucial techniques, Bio-Art creation, control are considered, considered crucial, crucial techniques
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Siggraph Asia 2024 Art Paper

点击查看摘要

Abstract:Fungal simulation and control are considered crucial techniques in Bio-Art creation. However, coding algorithms for reliable fungal simulations have posed significant challenges for artists. This study equates fungal morphology simulation to a two-dimensional graphic time-series generation problem. We propose a zero-coding, neural network-driven cellular automaton. Fungal spread patterns are learned through an image segmentation model and a time-series prediction model, which then supervise the training of neural network cells, enabling them to replicate real-world spreading behaviors. We further implemented dynamic containment of fungal boundaries with lasers. Synchronized with the automaton, the fungus successfully spreads into pre-designed complex shapes in reality.

[LG-70] Can OOD Object Detectors Learn from Foundation Models?

链接: https://arxiv.org/abs/2409.05162
作者: Jiahui Liu,Xin Wen,Shizhen Zhao,Yingxian Chen,Xiaojuan Qi
关键词-EN: challenging task due, OOD object detection, OOD, open-set OOD data, enhancing OOD object
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 19 pages, 4 figures

点击查看摘要

Abstract:Out-of-distribution (OOD) object detection is a challenging task due to the absence of open-set OOD data. Inspired by recent advancements in text-to-image generative models, such as Stable Diffusion, we study the potential of generative models trained on large-scale open-set data to synthesize OOD samples, thereby enhancing OOD object detection. We introduce SyncOOD, a simple data curation method that capitalizes on the capabilities of large foundation models to automatically extract meaningful OOD data from text-to-image generative models. This offers the model access to open-world knowledge encapsulated within off-the-shelf foundation models. The synthetic OOD samples are then employed to augment the training of a lightweight, plug-and-play OOD detector, thus effectively optimizing the in-distribution (ID)/OOD decision boundaries. Extensive experiments across multiple benchmarks demonstrate that SyncOOD significantly outperforms existing methods, establishing new state-of-the-art performance with minimal synthetic data usage.

[LG-71] OneGen: Efficient One-Pass Unified Generation and Retrieval for LLMs

链接: https://arxiv.org/abs/2409.05152
作者: Jintian Zhang,Cheng Peng,Mengshu Sun,Xiang Chen,Lei Liang,Zhiqiang Zhang,Jun Zhou,Huajun Chen,Ningyu Zhang
关键词-EN: Large Language Models, Language Models, Large Language, advancements in Large, directly handling retrieval
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Work in progress; code is available at this https URL

点击查看摘要

Abstract:Despite the recent advancements in Large Language Models (LLMs), which have significantly enhanced the generative capabilities for various NLP tasks, LLMs still face limitations in directly handling retrieval tasks. However, many practical applications demand the seamless integration of both retrieval and generation. This paper introduces a novel and efficient One-pass Generation and retrieval framework (OneGen), designed to improve LLMs’ performance on tasks that require both generation and retrieval. The proposed framework bridges the traditionally separate training approaches for generation and retrieval by incorporating retrieval tokens generated autoregressively. This enables a single LLM to handle both tasks simultaneously in a unified forward pass. We conduct experiments on two distinct types of composite tasks, RAG and Entity Linking, to validate the pluggability, effectiveness, and efficiency of OneGen in training and inference. Furthermore, our results show that integrating generation and retrieval within the same context preserves the generative capabilities of LLMs while improving retrieval performance. To the best of our knowledge, OneGen is the first to enable LLMs to conduct vector retrieval during the generation.

[LG-72] Imputation of Time-varying Edge Flows in Graphs by Multilinear Kernel Regression and Manifold Learning

链接: https://arxiv.org/abs/2409.05135
作者: Duc Thien Nguyen,Konstantinos Slavakis,Dimitris Pados
关键词-EN: recently developed framework, multilinear kernel regression, paper extends, extends the recently, recently developed
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:This paper extends the recently developed framework of multilinear kernel regression and imputation via manifold learning (MultiL-KRIM) to impute time-varying edge flows in a graph. MultiL-KRIM uses simplicial-complex arguments and Hodge Laplacians to incorporate the graph topology, and exploits manifold-learning arguments to identify latent geometries within features which are modeled as a point-cloud around a smooth manifold embedded in a reproducing kernel Hilbert space (RKHS). Following the concept of tangent spaces to smooth manifolds, linear approximating patches are used to add a collaborative-filtering flavor to the point-cloud approximations. Together with matrix factorizations, MultiL-KRIM effects dimensionality reduction, and enables efficient computations, without any training data or additional information. Numerical tests on real-network time-varying edge flows demonstrate noticeable improvements of MultiL-KRIM over several state-of-the-art schemes.

[LG-73] MaxCutPool: differentiable feature-aware Maxcut for pooling in graph neural networks

链接: https://arxiv.org/abs/2409.05100
作者: Carlo Abate,Filippo Maria Bianchi
关键词-EN: nodes and edges, MAXCUT, Graph Neural Networks, texttt, attributed graphs
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a novel approach to compute the \textttMAXCUT in attributed graphs, \textiti.e., graphs with features associated with nodes and edges. Our approach is robust to the underlying graph topology and is fully differentiable, making it possible to find solutions that jointly optimize the \textttMAXCUT along with other objectives. Based on the obtained \textttMAXCUT partition, we implement a hierarchical graph pooling layer for Graph Neural Networks, which is sparse, differentiable, and particularly suitable for downstream tasks on heterophilic graphs.

[LG-74] he first Cadenza challenges: using machine learning competitions to improve music for listeners with a hearing loss

链接: https://arxiv.org/abs/2409.05095
作者: Gerardo Roa Dabike,Michael A. Akeroyd,Scott Bannister,Jon P. Barker,Trevor J. Cox,Bruno Fazenda,Jennifer Firth,Simone Graetzer,Alinka Greasley,Rebecca R. Vos,William M. Whitmer
关键词-EN: universal solution, hearing loss, Hybrid Demucs, hearing, Audio Quality Index
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:It is well established that listening to music is an issue for those with hearing loss, and hearing aids are not a universal solution. How can machine learning be used to address this? This paper details the first application of the open challenge methodology to use machine learning to improve audio quality of music for those with hearing loss. The first challenge was a stand-alone competition (CAD1) and had 9 entrants. The second was an 2024 ICASSP grand challenge (ICASSP24) and attracted 17 entrants. The challenge tasks concerned demixing and remixing pop/rock music to allow a personalised rebalancing of the instruments in the mix, along with amplification to correct for raised hearing thresholds. The software baselines provided for entrants to build upon used two state-of-the-art demix algorithms: Hybrid Demucs and Open-Unmix. Evaluation of systems was done using the objective metric HAAQI, the Hearing-Aid Audio Quality Index. No entrants improved on the best baseline in CAD1 because there was insufficient room for improvement. Consequently, for ICASSP24 the scenario was made more difficult by using loudspeaker reproduction and specified gains to be applied before remixing. This also made the scenario more useful for listening through hearing aids. 9 entrants scored better than the the best ICASSP24 baseline. Most entrants used a refined version of Hybrid Demucs and NAL-R amplification. The highest scoring system combined the outputs of several demixing algorithms in an ensemble approach. These challenges are now open benchmarks for future research with the software and data being freely available.

[LG-75] Adaptive k-nearest neighbor classifier based on the local estimation of the shape operator

链接: https://arxiv.org/abs/2409.05084
作者: Alexandre Luís Magalhães Levada,Frank Nielsen,Michel Ferreira Cardia Haddad
关键词-EN: nonparametric classification, local Gaussian curvature, nearest neighbor, local, popular methods
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注: 18 pages, 4 figures

点击查看摘要

Abstract:The k -nearest neighbor ( k -NN) algorithm is one of the most popular methods for nonparametric classification. However, a relevant limitation concerns the definition of the number of neighbors k . This parameter exerts a direct impact on several properties of the classifier, such as the bias-variance tradeoff, smoothness of decision boundaries, robustness to noise, and class imbalance handling. In the present paper, we introduce a new adaptive k -nearest neighbours ( kK -NN) algorithm that explores the local curvature at a sample to adaptively defining the neighborhood size. The rationale is that points with low curvature could have larger neighborhoods (locally, the tangent space approximates well the underlying data shape), whereas points with high curvature could have smaller neighborhoods (locally, the tangent space is a loose approximation). We estimate the local Gaussian curvature by computing an approximation to the local shape operator in terms of the local covariance matrix as well as the local Hessian matrix. Results on many real-world datasets indicate that the new kK -NN algorithm yields superior balanced accuracy compared to the established k -NN method and also another adaptive k -NN algorithm. This is particularly evident when the number of samples in the training data is limited, suggesting that the kK -NN is capable of learning more discriminant functions with less data considering many relevant cases.

[LG-76] From Computation to Consumption: Exploring the Compute-Energy Link for Training and Testing Neural Networks for SED Systems

链接: https://arxiv.org/abs/2409.05080
作者: Constance Douwes,Romain Serizel
关键词-EN: machine learning models, environmental impact, machine learning, raised serious concerns, learning models
类目: Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:The massive use of machine learning models, particularly neural networks, has raised serious concerns about their environmental impact. Indeed, over the last few years we have seen an explosion in the computing costs associated with training and deploying these systems. It is, therefore, crucial to understand their energy requirements in order to better integrate them into the evaluation of models, which has so far focused mainly on performance. In this paper, we study several neural network architectures that are key components of sound event detection systems, using an audio tagging task as an example. We measure the energy consumption for training and testing small to large architectures and establish complex relationships between the energy consumption, the number of floating-point operations, the number of parameters, and the GPU/memory utilization.

[LG-77] A General Framework for Clustering and Distribution Matching with Bandit Feedback

链接: https://arxiv.org/abs/2409.05072
作者: Recep Can Yavas,Yuqi Huang,Vincent Y. F. Tan,Jonathan Scarlett
关键词-EN: arm pulls, arm, pulls, arms, bandit feedback
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 22 pages, submitted to Information Theory Transactions in September 2024

点击查看摘要

Abstract:We develop a general framework for clustering and distribution matching problems with bandit feedback. We consider a K -armed bandit model where some subset of K arms is partitioned into M groups. Within each group, the random variable associated to each arm follows the same distribution on a finite alphabet. At each time step, the decision maker pulls an arm and observes its outcome from the random variable associated to that arm. Subsequent arm pulls depend on the history of arm pulls and their outcomes. The decision maker has no knowledge of the distributions of the arms or the underlying partitions. The task is to devise an online algorithm to learn the underlying partition of arms with the least number of arm pulls on average and with an error probability not exceeding a pre-determined value \delta . Several existing problems fall under our general framework, including finding M pairs of arms, odd arm identification, and M -ary clustering of K arms belong to our general framework. We derive a non-asymptotic lower bound on the average number of arm pulls for any online algorithm with an error probability not exceeding \delta . Furthermore, we develop a computationally-efficient online algorithm based on the Track-and-Stop method and Frank–Wolfe algorithm, and show that the average number of arm pulls of our algorithm asymptotically matches that of the lower bound. Our refined analysis also uncovers a novel bound on the speed at which the average number of arm pulls of our algorithm converges to the fundamental limit as \delta vanishes.

[LG-78] Lepskii Principle for Distributed Kernel Ridge Regression

链接: https://arxiv.org/abs/2409.05070
作者: Shao-Bo Lin
关键词-EN: communicating local data, distributively stored data, tackling distributively stored, Parameter selection, Lepskii principle
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Parameter selection without communicating local data is quite challenging in distributed learning, exhibing an inconsistency between theoretical analysis and practical application of it in tackling distributively stored data. Motivated by the recently developed Lepskii principle and non-privacy communication protocol for kernel learning, we propose a Lepskii principle to equip distributed kernel ridge regression (DKRR) and consequently develop an adaptive DKRR with Lepskii principle (Lep-AdaDKRR for short) by using a double weighted averaging synthesization scheme. We deduce optimal learning rates for Lep-AdaDKRR and theoretically show that Lep-AdaDKRR succeeds in adapting to the regularity of regression functions, effective dimension decaying rate of kernels and different metrics of generalization, which fills the gap of the mentioned inconsistency between theory and application.

[LG-79] Some Results on Neural Network Stability Consistency and Convergence: Insights into Non-IID Data High-Dimensional Settings and Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2409.05030
作者: Ronald Katende,Henry Kasumba,Godwin Kakuba,John M. Mango
关键词-EN: paper addresses critical, addresses critical challenges, neural networks, non-IID data, paper addresses
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:This paper addresses critical challenges in machine learning, particularly the stability, consistency, and convergence of neural networks under non-IID data, distribution shifts, and high-dimensional settings. We provide new theoretical results on uniform stability for neural networks with dynamic learning rates in non-convex settings. Further, we establish consistency bounds for federated learning models in non-Euclidean spaces, accounting for distribution shifts and curvature effects. For Physics-Informed Neural Networks (PINNs), we derive stability, consistency, and convergence guarantees for solving Partial Differential Equations (PDEs) in noisy environments. These results fill significant gaps in understanding model behavior in complex, non-ideal conditions, paving the way for more robust and reliable machine learning applications.

[LG-80] Sequential Recommendation via Adaptive Robust Attention with Multi-dimensional Embeddings

链接: https://arxiv.org/abs/2409.05022
作者: Linsey Pang,Amir Hossein Raffiee,Wei Liu,Keld Lundgaard
关键词-EN: Abstract, Sequential recommendation, Sequential, achieved, significant accuracy boost
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sequential recommendation models have achieved state-of-the-art performance using self-attention mechanism. It has since been found that moving beyond only using item ID and positional embeddings leads to a significant accuracy boost when predicting the next item. In recent literature, it was reported that a multi-dimensional kernel embedding with temporal contextual kernels to capture users’ diverse behavioral patterns results in a substantial performance improvement. In this study, we further improve the sequential recommender model’s robustness and generalization by introducing a mix-attention mechanism with a layer-wise noise injection (LNI) regularization. We refer to our proposed model as adaptive robust sequential recommendation framework (ADRRec), and demonstrate through extensive experiments that our model outperforms existing self-attention architectures.

[LG-81] DynamicFL: Federated Learning with Dynamic Communication Resource Allocation

链接: https://arxiv.org/abs/2409.04986
作者: Qi Le,Enmao Diao,Xinran Wang,Vahid Tarokh,Jie Ding,Ali Anwar
关键词-EN: collaborative machine learning, machine learning framework, Federated Stochastic Gradient, machine learning, Federated Learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is a collaborative machine learning framework that allows multiple users to train models utilizing their local data in a distributed manner. However, considerable statistical heterogeneity in local data across devices often leads to suboptimal model performance compared with independently and identically distributed (IID) data scenarios. In this paper, we introduce DynamicFL, a new FL framework that investigates the trade-offs between global model performance and communication costs for two widely adopted FL methods: Federated Stochastic Gradient Descent (FedSGD) and Federated Averaging (FedAvg). Our approach allocates diverse communication resources to clients based on their data statistical heterogeneity, considering communication resource constraints, and attains substantial performance enhancements compared to uniform communication resource allocation. Notably, our method bridges the gap between FedSGD and FedAvg, providing a flexible framework leveraging communication heterogeneity to address statistical heterogeneity in FL. Through extensive experiments, we demonstrate that DynamicFL surpasses current state-of-the-art methods with up to a 10% increase in model accuracy, demonstrating its adaptability and effectiveness in tackling data statistical heterogeneity challenges.

[LG-82] Enhancing Convolutional Neural Networks with Higher-Order Numerical Difference Methods

链接: https://arxiv.org/abs/2409.04977
作者: Qi Wang,Zijun Gao,Mingxiu Sui,Taiyuan Mei,Xiaohan Cheng,Iris Li
关键词-EN: Convolutional Neural Networks, Convolutional Neural, deep learning technology, practical applications, real-world problems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the rise of deep learning technology in practical applications, Convolutional Neural Networks (CNNs) have been able to assist humans in solving many real-world problems. To enhance the performance of CNNs, numerous network architectures have been explored. Some of these architectures are designed based on the accumulated experience of researchers over time, while others are designed through neural architecture search methods. The improvements made to CNNs by the aforementioned methods are quite significant, but most of the improvement methods are limited in reality by model size and environmental constraints, making it difficult to fully realize the improved performance. In recent years, research has found that many CNN structures can be explained by the discretization of ordinary differential equations. This implies that we can design theoretically supported deep network structures using higher-order numerical difference methods. It should be noted that most of the previous CNN model structures are based on low-order numerical methods. Therefore, considering that the accuracy of linear multi-step numerical difference methods is higher than that of the forward Euler method, this paper proposes a stacking scheme based on the linear multi-step method. This scheme enhances the performance of ResNet without increasing the model size and compares it with the Runge-Kutta scheme. The experimental results show that the performance of the stacking scheme proposed in this paper is superior to existing stacking schemes (ResNet and HO-ResNet), and it has the capability to be extended to other types of neural networks.

[LG-83] PatchAlign:Fair and Accurate Skin Disease Image Classification by Alignment with Clinical Labels MICCAI2024

链接: https://arxiv.org/abs/2409.04975
作者: Aayushman,Hemanth Gaddey,Vidhi Mittal,Manisha Chawla,Gagan Raj Gupta
关键词-EN: achieved great success, Graph Optimal Transport, Deep learning models, Deep learning, Masked Graph Optimal
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: MICCAI 2024. Early Accept Paper (amongst the top 11% of 2869 papers submitted)

点击查看摘要

Abstract:Deep learning models have achieved great success in automating skin lesion diagnosis. However, the ethnic disparity in these models’ predictions needs to be addressed before deploying them. We introduce a novel approach, PatchAlign, to enhance skin condition image classification accuracy and fairness by aligning with clinical text representations of skin conditions. PatchAlign uses Graph Optimal Transport (GOT) Loss as a regularizer to perform cross-domain alignment. The representations obtained are robust and generalize well across skin tones, even with limited training samples. To reduce the effect of noise and artifacts in clinical dermatology images, we propose a learnable Masked Graph Optimal Transport for cross-domain alignment that further improves fairness metrics. We compare our model to the state-of-the-art FairDisCo on two skin lesion datasets with different skin types: Fitzpatrick17k and Diverse Dermatology Images (DDI). PatchAlign enhances the accuracy of skin condition image classification by 2.8% (in-domain) and 6.2% (out-domain) on Fitzpatrick17k, and 4.2% (in-domain) on DDI compared to FairDisCo. Additionally, it consistently improves the fairness of true positive rates across skin tones. The source code for the implementation is available at the following GitHub repository: this https URL, enabling easy reproduction and further experimentation. Comments: MICCAI 2024. Early Accept Paper (amongst the top 11% of 2869 papers submitted) Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2409.04975 [cs.CV] (or arXiv:2409.04975v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.04975 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-84] Balancing Security and Accuracy: A Novel Federated Learning Approach for Cyberattack Detection in Blockchain Networks

链接: https://arxiv.org/abs/2409.04972
作者: Tran Viet Khoa,Mohammad Abu Alsheikh,Yibeltal Alem,Dinh Thai Hoang
关键词-EN: Collaborative Cyberattack Detection, Collaborative Cyberattack, blockchain-based data-sharing networks, federated learning models, Cyberattack Detection
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 13 pages

点击查看摘要

Abstract:This paper presents a novel Collaborative Cyberattack Detection (CCD) system aimed at enhancing the security of blockchain-based data-sharing networks by addressing the complex challenges associated with noise addition in federated learning models. Leveraging the theoretical principles of differential privacy, our approach strategically integrates noise into trained sub-models before reconstructing the global model through transmission. We systematically explore the effects of various noise types, i.e., Gaussian, Laplace, and Moment Accountant, on key performance metrics, including attack detection accuracy, deep learning model convergence time, and the overall runtime of global model generation. Our findings reveal the intricate trade-offs between ensuring data privacy and maintaining system performance, offering valuable insights into optimizing these parameters for diverse CCD environments. Through extensive simulations, we provide actionable recommendations for achieving an optimal balance between data protection and system efficiency, contributing to the advancement of secure and reliable blockchain networks.

[LG-85] Soft Actor-Critic with Beta Policy via Implicit Reparameterization Gradients

链接: https://arxiv.org/abs/2409.04971
作者: Luca Della Libera
关键词-EN: poor sample efficiency, sample efficiency remains, deep reinforcement learning, Recent advances, achieved impressive results
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages

点击查看摘要

Abstract:Recent advances in deep reinforcement learning have achieved impressive results in a wide range of complex tasks, but poor sample efficiency remains a major obstacle to real-world deployment. Soft actor-critic (SAC) mitigates this problem by combining stochastic policy optimization and off-policy learning, but its applicability is restricted to distributions whose gradients can be computed through the reparameterization trick. This limitation excludes several important examples such as the beta distribution, which was shown to improve the convergence rate of actor-critic algorithms in high-dimensional continuous control problems thanks to its bounded support. To address this issue, we investigate the use of implicit reparameterization, a powerful technique that extends the class of reparameterizable distributions. In particular, we use implicit reparameterization gradients to train SAC with the beta policy on simulated robot locomotion environments and compare its performance with common baselines. Experimental results show that the beta policy is a viable alternative, as it outperforms the normal policy and is on par with the squashed normal policy, which is the go-to choice for SAC. The code is available at this https URL.

[LG-86] Attention-Based Efficient Breath Sound Removal in Studio Audio Recordings

链接: https://arxiv.org/abs/2409.04949
作者: Nidula Elgiriyewithana,N. D.Kodikara
关键词-EN: attention U-Net architecture, non-speech vocal sounds, specifically breath sounds, vocal recordings, non-speech vocal
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:In this research, we present an innovative, parameter-efficient model that utilizes the attention U-Net architecture for the automatic detection and eradication of non-speech vocal sounds, specifically breath sounds, in vocal recordings. This task is of paramount importance in the field of sound engineering, despite being relatively under-explored. The conventional manual process for detecting and eliminating these sounds requires significant expertise and is extremely time-intensive. Existing automated detection and removal methods often fall short in terms of efficiency and precision. Our proposed model addresses these limitations by offering a streamlined process and superior accuracy, achieved through the application of advanced deep learning techniques. A unique dataset, derived from Device and Produced Speech (DAPS), was employed for this purpose. The training phase of the model emphasizes a log spectrogram and integrates an early stopping mechanism to prevent overfitting. Our model not only conserves precious time for sound engineers but also enhances the quality and consistency of audio production. This constitutes a significant breakthrough, as evidenced by its comparative efficiency, necessitating only 1.9M parameters and a training duration of 3.2 hours - markedly less than the top-performing models in this domain. The model is capable of generating identical outputs as previous models with drastically improved precision, making it an optimal choice.

[LG-87] UMOD: A Novel and Effective Urban Metro Origin-Destination Flow Prediction Method

链接: https://arxiv.org/abs/2409.04942
作者: Peng Xie,Minbo Ma,Bin Wang,Junbo Zhang,Tianrui Li
关键词-EN: intelligent transportation systems, urban traffic management, traffic management, development of intelligent, intelligent transportation
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate prediction of metro Origin-Destination (OD) flow is essential for the development of intelligent transportation systems and effective urban traffic management. Existing approaches typically either predict passenger outflow of departure stations or inflow of destination stations. However, we argue that travelers generally have clearly defined departure and arrival stations, making these OD pairs inherently interconnected. Consequently, considering OD pairs as a unified entity more accurately reflects actual metro travel patterns and allows for analyzing potential spatio-temporal correlations between different OD pairs. To address these challenges, we propose a novel and effective urban metro OD flow prediction method (UMOD), comprising three core modules: a data embedding module, a temporal relation module, and a spatial relation module. The data embedding module projects raw OD pair inputs into hidden space representations, which are subsequently processed by the temporal and spatial relation modules to capture both inter-pair and intra-pair spatio-temporal dependencies. Experimental results on two real-world urban metro OD flow datasets demonstrate that adopting the OD pairs perspective is critical for accurate metro OD flow prediction. Our method outperforms existing approaches, delivering superior predictive performance.

[LG-88] An Analog and Digital Hybrid Attention Accelerator for Transformers with Charge-based In-memory Computing

链接: https://arxiv.org/abs/2409.04940
作者: Ashkan Moradifirouzabadi,Divya Sri Dodla,Mingu Kang
关键词-EN: calculating pairwise correlations, key computing kernel, entire input sequence, calculating pairwise, pairwise correlations
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 4 pages, 9 figures, to be published at ESSERC 2024

点击查看摘要

Abstract:The attention mechanism is a key computing kernel of Transformers, calculating pairwise correlations across the entire input sequence. The computing complexity and frequent memory access in computing self-attention put a huge burden on the system especially when the sequence length increases. This paper presents an analog and digital hybrid processor to accelerate the attention mechanism for transformers in 65nm CMOS technology. We propose an analog computing-in-memory (CIM) core, which prunes ~75% of low-score tokens on average during runtime at ultra-low power and delay. Additionally, a digital processor performs precise computations only for ~25% unpruned tokens selected by the analog CIM core, preventing accuracy degradation. Measured results show peak energy efficiency of 14.8 and 1.65 TOPS/W, and peak area efficiency of 976.6 and 79.4 GOPS/mm ^\mathrm2 in the analog core and the system-on-chip (SoC), respectively.

[LG-89] Collaborative Learning with Shared Linear Representations: Statistical Rates and Optimal Algorithms

链接: https://arxiv.org/abs/2409.04919
作者: Xiaochun Niu,Lili Su,Jiaming Xu,Pengkun Yang
关键词-EN: learn shared feature, improving model performance, shared feature representations, local data distributions, learning enables multiple
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Collaborative learning enables multiple clients to learn shared feature representations across local data distributions, with the goal of improving model performance and reducing overall sample complexity. While empirical evidence shows the success of collaborative learning, a theoretical understanding of the optimal statistical rate remains lacking, even in linear settings. In this paper, we identify the optimal statistical rate when clients share a common low-dimensional linear representation. Specifically, we design a spectral estimator with local averaging that approximates the optimal solution to the least squares problem. We establish a minimax lower bound to demonstrate that our estimator achieves the optimal error rate. Notably, the optimal rate reveals two distinct phases. In typical cases, our rate matches the standard rate based on the parameter counting of the linear representation. However, a statistical penalty arises in collaborative learning when there are too many clients or when local datasets are relatively small. Furthermore, our results, unlike existing ones, show that, at a system level, collaboration always reduces overall sample complexity compared to independent client learning. In addition, at an individual level, we provide a more precise characterization of when collaboration benefits a client in transfer learning and private fine-tuning.

[LG-90] Activation Function Optimization Scheme for Image Classification

链接: https://arxiv.org/abs/2409.04915
作者: Abdur Rahman,Lu He,Haifeng Wang
关键词-EN: activation functions, Activation, functions, significant impact, Error Linear Unit
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Activation function has a significant impact on the dynamics, convergence, and performance of deep neural networks. The search for a consistent and high-performing activation function has always been a pursuit during deep learning model development. Existing state-of-the-art activation functions are manually designed with human expertise except for Swish. Swish was developed using a reinforcement learning-based search strategy. In this study, we propose an evolutionary approach for optimizing activation functions specifically for image classification tasks, aiming to discover functions that outperform current state-of-the-art options. Through this optimization framework, we obtain a series of high-performing activation functions denoted as Exponential Error Linear Unit (EELU). The developed activation functions are evaluated for image classification tasks from two perspectives: (1) five state-of-the-art neural network architectures, such as ResNet50, AlexNet, VGG16, MobileNet, and Compact Convolutional Transformer which cover computationally heavy to light neural networks, and (2) eight standard datasets, including CIFAR10, Imagenette, MNIST, Fashion MNIST, Beans, Colorectal Histology, CottonWeedID15, and TinyImageNet which cover from typical machine vision benchmark, agricultural image applications to medical image applications. Finally, we statistically investigate the generalization of the resultant activation functions developed through the optimization scheme. With a Friedman test, we conclude that the optimization scheme is able to generate activation functions that outperform the existing standard ones in 92.8% cases among 28 different cases studied, and -x\cdot erf(e^-x) is found to be the best activation function for image classification generated by the optimization scheme.

[LG-91] NGD converges to less degenerate solutions than SGD

链接: https://arxiv.org/abs/2409.04913
作者: Moosa Saghir,N. R. Raghavendra,Zihe Liu,Evan Ryan Gunter
关键词-EN: Effective dimension, accurate measure, number of free, dimension, lambda
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 8 pages, 23 figures

点击查看摘要

Abstract:The number of free parameters, or dimension, of a model is a straightforward way to measure its complexity: a model with more parameters can encode more information. However, this is not an accurate measure of complexity: models capable of memorizing their training data often generalize well despite their high dimension. Effective dimension aims to more directly capture the complexity of a model by counting only the number of parameters required to represent the functionality of the model. Singular learning theory (SLT) proposes the learning coefficient \lambda as a more accurate measure of effective dimension. By describing the rate of increase of the volume of the region of parameter space around a local minimum with respect to loss, \lambda incorporates information from higher-order terms. We compare \lambda of models trained using natural gradient descent (NGD) and stochastic gradient descent (SGD), and find that those trained with NGD consistently have a higher effective dimension for both of our methods: the Hessian trace \textTr(\mathbfH) , and the estimate of the local learning coefficient (LLC) \hat\lambda(w^*) .

[LG-92] Unlocking the Potential of Model Calibration in Federated Learning

链接: https://arxiv.org/abs/2409.04901
作者: Yun-Wei Chu,Dong-Jun Han,Seyyedali Hosseinalipour,Christopher Brinton
关键词-EN: primary performance metric, federated learning, past several years, model, machine learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Over the past several years, various federated learning (FL) methodologies have been developed to improve model accuracy, a primary performance metric in machine learning. However, to utilize FL in practical decision-making scenarios, beyond considering accuracy, the trained model must also have a reliable confidence in each of its predictions, an aspect that has been largely overlooked in existing FL research. Motivated by this gap, we propose Non-Uniform Calibration for Federated Learning (NUCFL), a generic framework that integrates FL with the concept of model calibration. The inherent data heterogeneity in FL environments makes model calibration particularly difficult, as it must ensure reliability across diverse data distributions and client conditions. Our NUCFL addresses this challenge by dynamically adjusting the model calibration objectives based on statistical relationships between each client’s local model and the global model in FL. In particular, NUCFL assesses the similarity between local and global model relationships, and controls the penalty term for the calibration loss during client-side local training. By doing so, NUCFL effectively aligns calibration needs for the global model in heterogeneous FL settings while not sacrificing accuracy. Extensive experiments show that NUCFL offers flexibility and effectiveness across various FL algorithms, enhancing accuracy as well as model calibration.

[LG-93] Learning Joint Models of Prediction and Optimization

链接: https://arxiv.org/abs/2409.04898
作者: James Kotary,Vincenzo Di Vito,Jacob Cristopher,Pascal Van Hentenryck,Ferdinando Fioretto
关键词-EN: predict unknown parameters, machine learning models, framework uses machine, machine learning, predict unknown
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2311.13087

点击查看摘要

Abstract:The Predict-Then-Optimize framework uses machine learning models to predict unknown parameters of an optimization problem from exogenous features before solving. This setting is common to many real-world decision processes, and recently it has been shown that decision quality can be substantially improved by solving and differentiating the optimization problem within an end-to-end training loop. However, this approach requires significant computational effort in addition to handcrafted, problem-specific rules for backpropagation through the optimization step, challenging its applicability to a broad class of optimization problems. This paper proposes an alternative method, in which optimal solutions are learned directly from the observable features by joint predictive models. The approach is generic, and based on an adaptation of the Learning-to-Optimize paradigm, from which a rich variety of existing techniques can be employed. Experimental evaluations show the ability of several Learning-to-Optimize methods to provide efficient and accurate solutions to an array of challenging Predict-Then-Optimize problems.

[LG-94] Centralized Selection with Preferences in the Presence of Biases ICML2024

链接: https://arxiv.org/abs/2409.04897
作者: L. Elisa Celis,Amit Kumar,Nisheeth K. Vishnoi,Andrew Xu
关键词-EN: limited capacity, candidates, institutions, multiple institutions, true utility
类目: Data Structures and Algorithms (cs.DS); Computers and Society (cs.CY); Machine Learning (cs.LG); Theoretical Economics (econ.TH); Machine Learning (stat.ML)
*备注: The conference version of this paper appears in ICML 2024

点击查看摘要

Abstract:This paper considers the scenario in which there are multiple institutions, each with a limited capacity for candidates, and candidates, each with preferences over the institutions. A central entity evaluates the utility of each candidate to the institutions, and the goal is to select candidates for each institution in a way that maximizes utility while also considering the candidates’ preferences. The paper focuses on the setting in which candidates are divided into multiple groups and the observed utilities of candidates in some groups are biased–systematically lower than their true utilities. The first result is that, in these biased settings, prior algorithms can lead to selections with sub-optimal true utility and significant discrepancies in the fraction of candidates from each group that get their preferred choices. Subsequently, an algorithm is presented along with proof that it produces selections that achieve near-optimal group fairness with respect to preferences while also nearly maximizing the true utility under distributional assumptions. Further, extensive empirical validation of these results in real-world and synthetic settings, in which the distributional assumptions may not hold, are presented.

[LG-95] Learning to Open and Traverse Doors with a Legged Manipulator

链接: https://arxiv.org/abs/2409.04882
作者: Mike Zhang,Yuntao Ma,Takahiro Miki,Marco Hutter
关键词-EN: significant practical interest, giving robots greater, robots greater access, human-centric spaces, longstanding challenge
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Using doors is a longstanding challenge in robotics and is of significant practical interest in giving robots greater access to human-centric spaces. The task is challenging due to the need for online adaptation to varying door properties and precise control in manipulating the door panel and navigating through the confined doorway. To address this, we propose a learning-based controller for a legged manipulator to open and traverse through doors. The controller is trained using a teacher-student approach in simulation to learn robust task behaviors as well as estimate crucial door properties during the interaction. Unlike previous works, our approach is a single control policy that can handle both push and pull doors through learned behaviour which infers the opening direction during deployment without prior knowledge. The policy was deployed on the ANYmal legged robot with an arm and achieved a success rate of 95.0% in repeated trials conducted in an experimental setting. Additional experiments validate the policy’s effectiveness and robustness to various doors and disturbances. A video overview of the method and experiments can be found at this http URL.

[LG-96] Sequential Classification of Misinformation

链接: https://arxiv.org/abs/2409.04860
作者: Daniel Toma,Wasim Huleihel
关键词-EN: monitoring undesirable effects, information flow, undesirable effects, recent years, growing interest
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 42 pages

点击查看摘要

Abstract:In recent years there have been a growing interest in online auditing of information flow over social networks with the goal of monitoring undesirable effects, such as, misinformation and fake news. Most previous work on the subject, focus on the binary classification problem of classifying information as fake or genuine. Nonetheless, in many practical scenarios, the multi-class/label setting is of particular importance. For example, it could be the case that a social media platform may want to distinguish between true", partly-true", and ``false" information. Accordingly, in this paper, we consider the problem of online multiclass classification of information flow. To that end, driven by empirical studies on information flow over real-world social media networks, we propose a probabilistic information flow model over graphs. Then, the learning task is to detect the label of the information flow, with the goal of minimizing a combination of the classification error and the detection time. For this problem, we propose two detection algorithms; the first is based on the well-known multiple sequential probability ratio test, while the second is a novel graph neural network based sequential decision algorithm. For both algorithms, we prove several strong statistical guarantees. We also construct a data driven algorithm for learning the proposed probabilistic model. Finally, we test our algorithms over two real-world datasets, and show that they outperform other state-of-the-art misinformation detection algorithms, in terms of detection time and classification error.

[LG-97] FedModule: A Modular Federated Learning Framework

链接: https://arxiv.org/abs/2409.04849
作者: Chuyi Chen,Zhe Zhang,Yanchao Zhao
关键词-EN: smart cities, widely adopted, Federated learning, complex experimental scenarios, experimental scenarios
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated learning (FL) has been widely adopted across various applications, such as healthcare, finance, and smart cities. However, as experimental scenarios become more complex, existing FL frameworks and benchmarks have struggled to keep pace. This paper introduces FedModule, a flexible and extensible FL experimental framework that has been open-sourced to support diverse FL paradigms and provide comprehensive benchmarks for complex experimental scenarios. FedModule adheres to the “one code, all scenarios” principle and employs a modular design that breaks the FL process into individual components, allowing for the seamless integration of different FL paradigms. The framework supports synchronous, asynchronous, and personalized federated learning, with over 20 implemented algorithms. Experiments conducted on public datasets demonstrate the flexibility and extensibility of FedModule. The framework offers multiple execution modes-including linear, threaded, process-based, and distributed-enabling users to tailor their setups to various experimental needs. Additionally, FedModule provides extensive logging and testing capabilities, which facilitate detailed performance analysis of FL algorithms. Comparative evaluations against existing FL toolkits, such as TensorFlow Federated, PySyft, Flower, and FLGo, highlight FedModule’s superior scalability, flexibility, and comprehensive benchmark support. By addressing the limitations of current FL frameworks, FedModule marks a significant advancement in FL experimentation, providing researchers and practitioners with a robust tool for developing and evaluating FL algorithms across a wide range of scenarios.

[LG-98] Sample- and Oracle-Efficient Reinforcement Learning for MDPs with Linearly-Realizable Value Functions

链接: https://arxiv.org/abs/2409.04840
作者: Zakaria Mhammedi
关键词-EN: feasible reinforcement learning, Markov Decision Processes, Designing sample-efficient, computationally feasible reinforcement, reinforcement learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Designing sample-efficient and computationally feasible reinforcement learning (RL) algorithms is particularly challenging in environments with large or infinite state and action spaces. In this paper, we advance this effort by presenting an efficient algorithm for Markov Decision Processes (MDPs) where the state-action value function of any policy is linear in a given feature map. This challenging setting can model environments with infinite states and actions, strictly generalizes classic linear MDPs, and currently lacks a computationally efficient algorithm under online access to the MDP. Specifically, we introduce a new RL algorithm that efficiently finds a near-optimal policy in this setting, using a number of episodes and calls to a cost-sensitive classification (CSC) oracle that are both polynomial in the problem parameters. Notably, our CSC oracle can be efficiently implemented when the feature dimension is constant, representing a clear improvement over state-of-the-art methods, which require solving non-convex problems with horizon-many variables and can incur computational costs that are \emphexponential in the horizon.

[LG-99] Reward-Directed Score-Based Diffusion Models via q-Learning

链接: https://arxiv.org/abs/2409.04832
作者: Xuefeng Gao,Jiale Zha,Xun Yu Zhou
关键词-EN: maximize reward functions, generated distributions close, unknown target data, training continuous-time score-based, target data distributions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We propose a new reinforcement learning (RL) formulation for training continuous-time score-based diffusion models for generative AI to generate samples that maximize reward functions while keeping the generated distributions close to the unknown target data distributions. Different from most existing studies, our formulation does not involve any pretrained model for the unknown score functions of the noise-perturbed data distributions. We present an entropy-regularized continuous-time RL problem and show that the optimal stochastic policy has a Gaussian distribution with a known covariance matrix. Based on this result, we parameterize the mean of Gaussian policies and develop an actor-critic type (little) q-learning algorithm to solve the RL problem. A key ingredient in our algorithm design is to obtain noisy observations from the unknown score function via a ratio estimator. Numerically, we show the effectiveness of our approach by comparing its performance with two state-of-the-art RL methods that fine-tune pretrained models. Finally, we discuss extensions of our RL formulation to probability flow ODE implementation of diffusion models and to conditional diffusion models.

[LG-100] MILE: A Mutation Testing Framework of In-Context Learning Systems

链接: https://arxiv.org/abs/2409.04831
作者: Zeming Wei,Yihao Zhang,Meng Sun
关键词-EN: achieved notable success, large language models, achieved notable, notable success, applications of large
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In-context Learning (ICL) has achieved notable success in the applications of large language models (LLMs). By adding only a few input-output pairs that demonstrate a new task, the LLM can efficiently learn the task during inference without modifying the model parameters. Such mysterious ability of LLMs has attracted great research interests in understanding, formatting, and improving the in-context demonstrations, while still suffering from drawbacks like black-box mechanisms and sensitivity against the selection of examples. In this work, inspired by the foundations of adopting testing techniques in machine learning (ML) systems, we propose a mutation testing framework designed to characterize the quality and effectiveness of test data for ICL systems. First, we propose several mutation operators specialized for ICL demonstrations, as well as corresponding mutation scores for ICL test sets. With comprehensive experiments, we showcase the effectiveness of our framework in evaluating the reliability and quality of ICL test suites. Our code is available at this https URL.

[LG-101] NASH: Neural Architecture and Accelerator Search for Multiplication-Reduced Hybrid Models

链接: https://arxiv.org/abs/2409.04829
作者: Yang Xu,Huihong Shi,Zhongfeng Wang
关键词-EN: significant computational cost, deep neural networks, uparrow, edge devices, Search
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The significant computational cost of multiplications hinders the deployment of deep neural networks (DNNs) on edge devices. While multiplication-free models offer enhanced hardware efficiency, they typically sacrifice accuracy. As a solution, multiplication-reduced hybrid models have emerged to combine the benefits of both approaches. Particularly, prior works, i.e., NASA and NASA-F, leverage Neural Architecture Search (NAS) to construct such hybrid models, enhancing hardware efficiency while maintaining accuracy. However, they either entail costly retraining or encounter gradient conflicts, limiting both search efficiency and accuracy. Additionally, they overlook the acceleration opportunity introduced by accelerator search, yielding sub-optimal hardware performance. To overcome these limitations, we propose NASH, a Neural architecture and Accelerator Search framework for multiplication-reduced Hybrid models. Specifically, as for NAS, we propose a tailored zero-shot metric to pre-identify promising hybrid models before training, enhancing search efficiency while alleviating gradient conflicts. Regarding accelerator search, we innovatively introduce coarse-to-fine search to streamline the search process. Furthermore, we seamlessly integrate these two levels of searches to unveil NASH, obtaining the optimal model and accelerator pairing. Experiments validate our effectiveness, e.g., when compared with the state-of-the-art multiplication-based system, we can achieve \uparrow 2.14\times throughput and \uparrow 2.01\times FPS with \uparrow 0.25% accuracy on CIFAR-100, and \uparrow 1.40\times throughput and \uparrow 1.19\times FPS with \uparrow 0.56% accuracy on Tiny-ImageNet. Codes are available at \urlthis https URL.

[LG-102] FreeAugment: Data Augmentation Search Across All Degrees of Freedom ECCV2024

链接: https://arxiv.org/abs/2409.04820
作者: Tom Bekor,Niv Nayman,Lihi Zelnik-Manor
关键词-EN: automatic data augmentation, data augmentation search, Data augmentation, neural networks, integral part
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:Data augmentation has become an integral part of deep learning, as it is known to improve the generalization capabilities of neural networks. Since the most effective set of image transformations differs between tasks and domains, automatic data augmentation search aims to alleviate the extreme burden of manually finding the optimal image transformations. However, current methods are not able to jointly optimize all degrees of freedom: (1) the number of transformations to be applied, their (2) types, (3) order, and (4) magnitudes. Many existing methods risk picking the same transformation more than once, limit the search to two transformations only, or search for the number of transformations exhaustively or iteratively in a myopic manner. Our approach, FreeAugment, is the first to achieve global optimization of all four degrees of freedom simultaneously, using a fully differentiable method. It efficiently learns the number of transformations and a probability distribution over their permutations, inherently refraining from redundant repetition while sampling. Our experiments demonstrate that this joint learning of all degrees of freedom significantly improves performance, achieving state-of-the-art results on various natural image benchmarks and beyond across other domains. Project page at this https URL

[LG-103] Generalized Learning of Coefficients in Spectral Graph Convolutional Networks

链接: https://arxiv.org/abs/2409.04813
作者: Mustafa Coşkun,Ananth Grama,Mehmet Koyutürk
关键词-EN: Spectral Graph Convolutional, Graph Convolutional Networks, Convolutional Networks, Spectral Graph, Graph Convolutional
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Spectral Graph Convolutional Networks (GCNs) have gained popularity in graph machine learning applications due, in part, to their flexibility in specification of network propagation rules. These propagation rules are often constructed as polynomial filters whose coefficients are learned using label information during training. In contrast to learned polynomial filters, explicit filter functions are useful in capturing relationships between network topology and distribution of labels across the network. A number of algorithms incorporating either approach have been proposed; however the relationship between filter functions and polynomial approximations is not fully resolved. This is largely due to the ill-conditioned nature of the linear systems that must be solved to derive polynomial approximations of filter functions. To address this challenge, we propose a novel Arnoldi orthonormalization-based algorithm, along with a unifying approach, called G-Arnoldi-GCN that can efficiently and effectively approximate a given filter function with a polynomial. We evaluate G-Arnoldi-GCN in the context of multi-class node classification across ten datasets with diverse topological characteristics. Our experiments show that G-Arnoldi-GCN consistently outperforms state-of-the-art methods when suitable filter functions are employed. Overall, G-Arnoldi-GCN opens important new directions in graph machine learning by enabling the explicit design and application of diverse filter functions. Code link: https://anonymous.4open.science/r/GArnoldi-GCN-F7E2/README.md

[LG-104] Phrase-Level Adversarial Training for Mitigating Bias in Neural Network-based Automatic Essay Scoring

链接: https://arxiv.org/abs/2409.04795
作者: Haddad Philip,Tsegaye Misikir Tashu
关键词-EN: Automatic Essay Scoring, Automatic Essay, educational purposes, candidates for educational, AES
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automatic Essay Scoring (AES) is widely used to evaluate candidates for educational purposes. However, due to the lack of representative data, most existing AES systems are not robust, and their scoring predictions are biased towards the most represented data samples. In this study, we propose a model-agnostic phrase-level method to generate an adversarial essay set to address the biases and robustness of AES models. Specifically, we construct an attack test set comprising samples from the original test set and adversarially generated samples using our proposed method. To evaluate the effectiveness of the attack strategy and data augmentation, we conducted a comprehensive analysis utilizing various neural network scoring models. Experimental results show that the proposed approach significantly improves AES model performance in the presence of adversarial examples and scenarios without such attacks.

[LG-105] Beyond One-Time Validation: A Framework for Adaptive Validation of Prognostic and Diagnostic AI-based Medical Devices

链接: https://arxiv.org/abs/2409.04794
作者: Florian Hellmeier,Kay Brosien,Carsten Eickhoff,Alexander Meyer
关键词-EN: diagnostic AI-based medical, hold immense promise, Prognostic and diagnostic, AI-based medical devices, medical devices hold
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 13 pages, 1 figure

点击查看摘要

Abstract:Prognostic and diagnostic AI-based medical devices hold immense promise for advancing healthcare, yet their rapid development has outpaced the establishment of appropriate validation methods. Existing approaches often fall short in addressing the complexity of practically deploying these devices and ensuring their effective, continued operation in real-world settings. Building on recent discussions around the validation of AI models in medicine and drawing from validation practices in other fields, a framework to address this gap is presented. It offers a structured, robust approach to validation that helps ensure device reliability across differing clinical environments. The primary challenges to device performance upon deployment are discussed while highlighting the impact of changes related to individual healthcare institutions and operational processes. The presented framework emphasizes the importance of repeating validation and fine-tuning during deployment, aiming to mitigate these issues while being adaptable to challenges unforeseen during device development. The framework is also positioned within the current US and EU regulatory landscapes, underscoring its practical viability and relevance considering regulatory requirements. Additionally, a practical example demonstrating potential benefits of the framework is presented. Lastly, guidance on assessing model performance is offered and the importance of involving clinical stakeholders in the validation and fine-tuning process is discussed.

[LG-106] Improving Deep Reinforcement Learning by Reducing the Chain Effect of Value and Policy Churn

链接: https://arxiv.org/abs/2409.04792
作者: Hongyao Tang,Glen Berseth
关键词-EN: Deep neural networks, provide Reinforcement Learning, networks provide Reinforcement, large-scale decision-making problems, address large-scale decision-making
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep neural networks provide Reinforcement Learning (RL) powerful function approximators to address large-scale decision-making problems. However, these approximators introduce challenges due to the non-stationary nature of RL training. One source of the challenges in RL is that output predictions can churn, leading to uncontrolled changes after each batch update for states not included in the batch. Although such a churn phenomenon exists in each step of network training, how churn occurs and impacts RL remains under-explored. In this work, we start by characterizing churn in a view of Generalized Policy Iteration with function approximation, and we discover a chain effect of churn that leads to a cycle where the churns in value estimation and policy improvement compound and bias the learning dynamics throughout the iteration. Further, we concretize the study and focus on the learning issues caused by the chain effect in different settings, including greedy action deviation in value-based methods, trust region violation in proximal policy optimization, and dual bias of policy value in actor-critic methods. We then propose a method to reduce the chain effect across different settings, called Churn Approximated ReductIoN (CHAIN), which can be easily plugged into most existing DRL algorithms. Our experiments demonstrate the effectiveness of our method in both reducing churn and improving learning performance across online and offline, value-based and policy-based RL settings, as well as a scaling setting.

[LG-107] forester: A Tree-Based AutoML Tool in R

链接: https://arxiv.org/abs/2409.04789
作者: Hubert Ruczyński,Anna Kozak
关键词-EN: automated machine learning, developed in Python, machine learning, majority of automated, large percentage
类目: Machine Learning (cs.LG); Mathematical Software (cs.MS); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:The majority of automated machine learning (AutoML) solutions are developed in Python, however a large percentage of data scientists are associated with the R language. Unfortunately, there are limited R solutions available. Moreover high entry level means they are not accessible to everyone, due to required knowledge about machine learning (ML). To fill this gap, we present the forester package, which offers ease of use regardless of the user’s proficiency in the area of machine learning. The forester is an open-source AutoML package implemented in R designed for training high-quality tree-based models on tabular data. It fully supports binary and multiclass classification, regression, and partially survival analysis tasks. With just a few functions, the user is capable of detecting issues regarding the data quality, preparing the preprocessing pipeline, training and tuning tree-based models, evaluating the results, and creating the report for further analysis. Subjects: Machine Learning (cs.LG); Mathematical Software (cs.MS); Methodology (stat.ME) Cite as: arXiv:2409.04789 [cs.LG] (or arXiv:2409.04789v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.04789 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-108] Selective Self-Rehearsal: A Fine-Tuning Approach to Improve Generalization in Large Language Models

链接: https://arxiv.org/abs/2409.04787
作者: Sonam Gupta,Yatin Nandwani,Asaf Yehudai,Mayank Mishra,Gaurav Pandey,Dinesh Raghu,Sachindra Joshi
关键词-EN: Fine-tuning Large Language, Large Language Models, Large Language, Fine-tuning Large, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 14 pages, 8 figures

点击查看摘要

Abstract:Fine-tuning Large Language Models (LLMs) on specific datasets is a common practice to improve performance on target tasks. However, this performance gain often leads to overfitting, where the model becomes too specialized in either the task or the characteristics of the training data, resulting in a loss of generalization. This paper introduces Selective Self-Rehearsal (SSR), a fine-tuning approach that achieves performance comparable to the standard supervised fine-tuning (SFT) while improving generalization. SSR leverages the fact that there can be multiple valid responses to a query. By utilizing the model’s correct responses, SSR reduces model specialization during the fine-tuning stage. SSR first identifies the correct model responses from the training set by deploying an appropriate LLM as a judge. Then, it fine-tunes the model using the correct model responses and the gold response for the remaining samples. The effectiveness of SSR is demonstrated through experiments on the task of identifying unanswerable queries across various datasets. The results show that standard SFT can lead to an average performance drop of up to 16.7% on multiple benchmarks, such as MMLU and TruthfulQA. In contrast, SSR results in close to 2% drop on average, indicating better generalization capabilities compared to standard SFT.

[LG-109] Component Fourier Neural Operator for Singularly Perturbed Differential Equations

链接: https://arxiv.org/abs/2409.04779
作者: Ye Li,Ting Du,Yiwen Pang,Zhongyi Huang
关键词-EN: Singularly Perturbed Differential, Solving Singularly Perturbed, Perturbed Differential Equations, Singularly Perturbed, poses computational challenges
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Solving Singularly Perturbed Differential Equations (SPDEs) poses computational challenges arising from the rapid transitions in their solutions within thin regions. The effectiveness of deep learning in addressing differential equations motivates us to employ these methods for solving SPDEs. In this manuscript, we introduce Component Fourier Neural Operator (ComFNO), an innovative operator learning method that builds upon Fourier Neural Operator (FNO), while simultaneously incorporating valuable prior knowledge obtained from asymptotic analysis. Our approach is not limited to FNO and can be applied to other neural network frameworks, such as Deep Operator Network (DeepONet), leading to potential similar SPDEs solvers. Experimental results across diverse classes of SPDEs demonstrate that ComFNO significantly improves accuracy compared to vanilla FNO. Furthermore, ComFNO exhibits natural adaptability to diverse data distributions and performs well in few-shot scenarios, showcasing its excellent generalization ability in practical situations.

[LG-110] LoCa: Logit Calibration for Knowledge Distillation ECAI2024

链接: https://arxiv.org/abs/2409.04778
作者: Runming Yang,Taiqiang Wu,Yujiu Yang
关键词-EN: aiming to train, plays an important, important role, model compression, teacher model
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted by ECAI 2024

点击查看摘要

Abstract:Knowledge Distillation (KD), aiming to train a better student model by mimicking the teacher model, plays an important role in model compression. One typical way is to align the output logits. However, we find a common issue named mis-instruction, that the student would be misled when the predictions based on teacher logits do not follow the labels. Meanwhile, there is other useful dark knowledge in the logits such as the class discriminability, which is vital for distillation. In this paper, we propose a simple yet effective Logit Calibration (LoCa) method, which calibrates the logits from the teacher model based on the ground-truth labels. The key insight is to correct the prediction (to address the mis-instruction issue) and maintain useful dark knowledge simultaneously. Our proposed LoCa does not require any additional parameters. Empirical results on image classification and text generation tasks demonstrate that LoCa can effectively improve the performance of baselines.

[LG-111] Optimization Hyper-parameter Laws for Large Language Models

链接: https://arxiv.org/abs/2409.04777
作者: Xingyu Xie,Kuangyu Ding,Shuicheng Yan,Kim-Chuan Toh,Tianwen Wei
关键词-EN: Large Language Models, Large Language, Language Models, significant AI advancements, Optimization Hyper-parameter Laws
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Large Language Models have driven significant AI advancements, yet their training is resource-intensive and highly sensitive to hyper-parameter selection. While scaling laws provide valuable guidance on model size and data requirements, they fall short in choosing dynamic hyper-parameters, such as learning-rate (LR) schedules, that evolve during training. To bridge this gap, we present Optimization Hyper-parameter Laws (Opt-Laws), a framework that effectively captures the relationship between hyper-parameters and training outcomes, enabling the pre-selection of potential optimal schedules. Grounded in stochastic differential equations, Opt-Laws introduce novel mathematical interpretability and offer a robust theoretical foundation for some popular LR schedules. Our extensive validation across diverse model sizes and data scales demonstrates Opt-Laws’ ability to accurately predict training loss and identify optimal LR schedule candidates in pre-training, continual training, and fine-tuning scenarios. This approach significantly reduces computational costs while enhancing overall model performance.

[LG-112] Cross-Dataset Gaze Estimation by Evidential Inter-intra Fusion ACM-MM2024

链接: https://arxiv.org/abs/2409.04766
作者: Shijing Wang,Yaping Huang,Jun Xie,YiTian,Feng Chen,Zhepeng Wang
关键词-EN: environments remains challenging, Achieving accurate, diverse environments remains, reliable gaze predictions, remains challenging
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: This paper was previously submitted to ACM MM 2024

点击查看摘要

Abstract:Achieving accurate and reliable gaze predictions in complex and diverse environments remains challenging. Fortunately, it is straightforward to access diverse gaze datasets in real-world applications. We discover that training these datasets jointly can significantly improve the generalization of gaze estimation, which is overlooked in previous works. However, due to the inherent distribution shift across different datasets, simply mixing multiple dataset decreases the performance in the original domain despite gaining better generalization abilities. To address the problem of ``cross-dataset gaze estimation’', we propose a novel Evidential Inter-intra Fusion EIF framework, for training a cross-dataset model that performs well across all source and unseen domains. Specifically, we build independent single-dataset branches for various datasets where the data space is partitioned into overlapping subspaces within each dataset for local regression, and further create a cross-dataset branch to integrate the generalizable features from single-dataset branches. Furthermore, evidential regressors based on the Normal and Inverse-Gamma (NIG) distribution are designed to additionally provide uncertainty estimation apart from predicting gaze. Building upon this foundation, our proposed framework achieves both intra-evidential fusion among multiple local regressors within each dataset and inter-evidential fusion among multiple branches by Mixture \textbfof Normal Inverse-Gamma (MoNIG distribution. Experiments demonstrate that our method consistently achieves notable improvements in both source domains and unseen domains.

[LG-113] Adaptative Context Normalization: A Boost for Deep Learning in Image Processing

链接: https://arxiv.org/abs/2409.04759
作者: Bilal Faye,Hanane Azzag,Mustapha Lebbah,Djamel Bouchaffra
关键词-EN: Deep Neural network, Neural network learning, Deep Neural, faces major challenges, major challenges related
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2403.16798

点击查看摘要

Abstract:Deep Neural network learning for image processing faces major challenges related to changes in distribution across layers, which disrupt model convergence and performance. Activation normalization methods, such as Batch Normalization (BN), have revolutionized this field, but they rely on the simplified assumption that data distribution can be modelled by a single Gaussian distribution. To overcome these limitations, Mixture Normalization (MN) introduced an approach based on a Gaussian Mixture Model (GMM), assuming multiple components to model the data. However, this method entails substantial computational requirements associated with the use of Expectation-Maximization algorithm to estimate parameters of each Gaussian components. To address this issue, we introduce Adaptative Context Normalization (ACN), a novel supervised approach that introduces the concept of “context”, which groups together a set of data with similar characteristics. Data belonging to the same context are normalized using the same parameters, enabling local representation based on contexts. For each context, the normalized parameters, as the model weights are learned during the backpropagation phase. ACN not only ensures speed, convergence, and superior performance compared to BN and MN but also presents a fresh perspective that underscores its particular efficacy in the field of image processing.

[LG-114] Unsupervised Adaptive Normalization

链接: https://arxiv.org/abs/2409.04757
作者: Bilal Faye,Hanane Azzag,Mustapha Lebbah,Fangchen Fang
关键词-EN: solving intricate problems, intricate problems, proving their mettle, array of applications, Normalization
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2403.16798

点击查看摘要

Abstract:Deep neural networks have become a staple in solving intricate problems, proving their mettle in a wide array of applications. However, their training process is often hampered by shifting activation distributions during backpropagation, resulting in unstable gradients. Batch Normalization (BN) addresses this issue by normalizing activations, which allows for the use of higher learning rates. Despite its benefits, BN is not without drawbacks, including its dependence on mini-batch size and the presumption of a uniform distribution of samples. To overcome this, several alternatives have been proposed, such as Layer Normalization, Group Normalization, and Mixture Normalization. These methods may still struggle to adapt to the dynamic distributions of neuron activations during the learning process. To bridge this gap, we introduce Unsupervised Adaptive Normalization (UAN), an innovative algorithm that seamlessly integrates clustering for normalization with deep neural network learning in a singular process. UAN executes clustering using the Gaussian mixture model, determining parameters for each identified cluster, by normalizing neuron activations. These parameters are concurrently updated as weights in the deep neural network, aligning with the specific requirements of the target task during backpropagation. This unified approach of clustering and normalization, underpinned by neuron activation normalization, fosters an adaptive data representation that is specifically tailored to the target task. This adaptive feature of UAN enhances gradient stability, resulting in faster learning and augmented neural network performance. UAN outperforms the classical methods by adapting to the target task and is effective in classification, and domain adaptation.

[LG-115] Explicit Mutual Information Maximization for Self-Supervised Learning

链接: https://arxiv.org/abs/2409.04747
作者: Lele Chang,Peilin Liu,Qinghai Guo,Fei Wen
关键词-EN: self-supervised learning, extensively studied, SSL, Recently, MIM
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, self-supervised learning (SSL) has been extensively studied. Theoretically, mutual information maximization (MIM) is an optimal criterion for SSL, with a strong theoretical foundation in information theory. However, it is difficult to directly apply MIM in SSL since the data distribution is not analytically available in applications. In practice, many existing methods can be viewed as approximate implementations of the MIM criterion. This work shows that, based on the invariance property of MI, explicit MI maximization can be applied to SSL under a generic distribution assumption, i.e., a relaxed condition of the data distribution. We further illustrate this by analyzing the generalized Gaussian distribution. Based on this result, we derive a loss function based on the MIM criterion using only second-order statistics. We implement the new loss for SSL and demonstrate its effectiveness via extensive experiments.

[LG-116] LMGT: Optimizing Exploration-Exploitation Balance in Reinforcement Learning through Language Model Guided Trade-offs

链接: https://arxiv.org/abs/2409.04744
作者: Yongxin Deng,Xihe Qiu,Xiaoyu Tan,Wei Chu,Yinghui Xu
关键词-EN: environmental transition model, agent expected reward, Reinforcement Learning, necessitates a careful, uncertainty inherent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The uncertainty inherent in the environmental transition model of Reinforcement Learning (RL) necessitates a careful balance between exploration and exploitation to optimize the use of computational resources for accurately estimating an agent’s expected reward. Achieving balance in control systems is particularly challenging in scenarios with sparse rewards. However, given the extensive prior knowledge available for many environments, it is redundant to begin learning from scratch in such settings. To address this, we introduce \textbfLanguage \textbfModel \textbfGuided \textbfTrade-offs (i.e., \textbfLMGT), a novel, sample-efficient framework that leverages the comprehensive prior knowledge embedded in Large Language Models (LLMs) and their adeptness at processing non-standard data forms, such as wiki tutorials. LMGT proficiently manages the exploration-exploitation trade-off by employing reward shifts guided by LLMs, which direct agents’ exploration endeavors, thereby improving sample efficiency. We have thoroughly tested LMGT across various RL tasks and deployed it in industrial-grade RL recommendation systems, where it consistently outperforms baseline methods. The results indicate that our framework can significantly reduce the time cost required during the training phase in RL.

[LG-117] GRVFL-2V: Graph Random Vector Functional Link Based on Two-View Learning

链接: https://arxiv.org/abs/2409.04743
作者: M. Tanveer,R. K. Sharma,M. Sajid,A. Quadir
关键词-EN: proposed model, randomized neural network, RVFL, random vector functional, vector functional link
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The classification performance of the random vector functional link (RVFL), a randomized neural network, has been widely acknowledged. However, due to its shallow learning nature, RVFL often fails to consider all the relevant information available in a dataset. Additionally, it overlooks the geometrical properties of the dataset. To address these limitations, a novel graph random vector functional link based on two-view learning (GRVFL-2V) model is proposed. The proposed model is trained on multiple views, incorporating the concept of multiview learning (MVL), and it also incorporates the geometrical properties of all the views using the graph embedding (GE) framework. The fusion of RVFL networks, MVL, and GE framework enables our proposed model to achieve the following: i) \textitefficient learning: by leveraging the topology of RVFL, our proposed model can efficiently capture nonlinear relationships within the multi-view data, facilitating efficient and accurate predictions; ii) \textitcomprehensive representation: fusing information from diverse perspectives enhance the proposed model’s ability to capture complex patterns and relationships within the data, thereby improving the model’s overall generalization performance; and iii) \textitstructural awareness: by employing the GE framework, our proposed model leverages the original data distribution of the dataset by naturally exploiting both intrinsic and penalty subspace learning criteria. The evaluation of the proposed GRVFL-2V model on various datasets, including 27 UCI and KEEL datasets, 50 datasets from Corel5k, and 45 datasets from AwA, demonstrates its superior performance compared to baseline models. These results highlight the enhanced generalization capabilities of the proposed GRVFL-2V model across a diverse range of datasets.

[LG-118] Up-sampling-only and Adaptive Mesh-based GNN for Simulating Physical Systems

链接: https://arxiv.org/abs/2409.04740
作者: Fu Lin,Jiasheng Shi,Shijie Luo,Qinpei Zhao,Weixiong Rao,Lei Chen
关键词-EN: Partial Differential Equations, Finite Element Method, Differential Equations, Element Method, Partial Differential
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Traditional simulation of complex mechanical systems relies on numerical solvers of Partial Differential Equations (PDEs), e.g., using the Finite Element Method (FEM). The FEM solvers frequently suffer from intensive computation cost and high running time. Recent graph neural network (GNN)-based simulation models can improve running time meanwhile with acceptable accuracy. Unfortunately, they are hard to tailor GNNs for complex mechanical systems, including such disadvantages as ineffective representation and inefficient message propagation (MP). To tackle these issues, in this paper, with the proposed Up-sampling-only and Adaptive MP techniques, we develop a novel hierarchical Mesh Graph Network, namely UA-MGN, for efficient and effective mechanical simulation. Evaluation on two synthetic and one real datasets demonstrates the superiority of the UA-MGN. For example, on the Beam dataset, compared to the state-of-the-art MS-MGN, UA-MGN leads to 40.99% lower errors but using only 43.48% fewer network parameters and 4.49% fewer floating point operations (FLOPs).

[LG-119] A Sample Efficient Alternating Minimization-based Algorithm For Robust Phase Retrieval

链接: https://arxiv.org/abs/2409.04733
作者: Adarsh Barik,Anand Krishna,Vincent Y. F. Tan
关键词-EN: arbitrarily corrupted magnitude-only, potentially arbitrarily corrupted, magnitude-only linear measurements, robust phase retrieval, unknown signal
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we study the robust phase retrieval problem where the task is to recover an unknown signal \theta^* \in \mathbbR^d in the presence of potentially arbitrarily corrupted magnitude-only linear measurements. We propose an alternating minimization approach that incorporates an oracle solver for a non-convex optimization problem as a subroutine. Our algorithm guarantees convergence to \theta^* and provides an explicit polynomial dependence of the convergence rate on the fraction of corrupted measurements. We then provide an efficient construction of the aforementioned oracle under a sparse arbitrary outliers model and offer valuable insights into the geometric properties of the loss landscape in phase retrieval with corrupted measurements. Our proposed oracle avoids the need for computationally intensive spectral initialization, using a simple gradient descent algorithm with a constant step size and random initialization instead. Additionally, our overall algorithm achieves nearly linear sample complexity, \mathcalO(d , \mathrmpolylog(d)) .

[LG-120] Urban traffic analysis and forecasting through shared Koopman eigenmodes

链接: https://arxiv.org/abs/2409.04728
作者: Chuhan Yang,Fares B. Mehouachi,Monica Menendez,Saif Eddin Jabari
关键词-EN: Predicting traffic flow, constrained Hankelized DMD, Dynamic Mode Decomposition, limited historical data, Predicting traffic
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Predicting traffic flow in data-scarce cities is challenging due to limited historical data. To address this, we leverage transfer learning by identifying periodic patterns common to data-rich cities using a customized variant of Dynamic Mode Decomposition (DMD): constrained Hankelized DMD (TrHDMD). This method uncovers common eigenmodes (urban heartbeats) in traffic patterns and transfers them to data-scarce cities, significantly enhancing prediction performance. TrHDMD reduces the need for extensive training datasets by utilizing prior knowledge from other cities. By applying Koopman operator theory to multi-city loop detector data, we identify stable, interpretable, and time-invariant traffic modes. Injecting ``urban heartbeats’’ into forecasting tasks improves prediction accuracy and has the potential to enhance traffic management strategies for cities with varying data infrastructures. Our work introduces cross-city knowledge transfer via shared Koopman eigenmodes, offering actionable insights and reliable forecasts for data-scarce urban environments.

[LG-121] A Comprehensive Survey on Evidential Deep Learning and Its Applications

链接: https://arxiv.org/abs/2409.04720
作者: Junyu Gao,Mengyuan Chen,Liangyu Xiang,Changsheng Xu
关键词-EN: Reliable uncertainty estimation, deep learning algorithms, uncertainty estimation, medical diagnosis, Reliable uncertainty
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reliable uncertainty estimation has become a crucial requirement for the industrial deployment of deep learning algorithms, particularly in high-risk applications such as autonomous driving and medical diagnosis. However, mainstream uncertainty estimation methods, based on deep ensembling or Bayesian neural networks, generally impose substantial computational overhead. To address this challenge, a novel paradigm called Evidential Deep Learning (EDL) has emerged, providing reliable uncertainty estimation with minimal additional computation in a single forward pass. This survey provides a comprehensive overview of the current research on EDL, designed to offer readers a broad introduction to the field without assuming prior knowledge. Specifically, we first delve into the theoretical foundation of EDL, the subjective logic theory, and discuss its distinctions from other uncertainty estimation frameworks. We further present existing theoretical advancements in EDL from four perspectives: reformulating the evidence collection process, improving uncertainty estimation via OOD samples, delving into various training strategies, and evidential regression networks. Thereafter, we elaborate on its extensive applications across various machine learning paradigms and downstream tasks. In the end, an outlook on future directions for better performances and broader adoption of EDL is provided, highlighting potential research avenues.

[LG-122] Cross-Organ Domain Adaptive Neural Network for Pancreatic Endoscopic Ultrasound Image Segmentation

链接: https://arxiv.org/abs/2409.04718
作者: ZhiChao Yan,Hui Xue,Yi Zhu,Bin Xiao,Hao Yuan
关键词-EN: effective diagnosis, pancreatic EUS images, EUS images, universal network, crisp EUS images
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate segmentation of lesions in pancreatic endoscopic ultrasound (EUS) images is crucial for effective diagnosis and treatment. However, the collection of enough crisp EUS images for effective diagnosis is arduous. Recently, domain adaptation (DA) has been employed to address these challenges by leveraging related knowledge from other domains. Most DA methods only focus on multi-view representations of the same organ, which makes it still tough to clearly depict the tumor lesion area with limited semantic information. Although transferring homogeneous similarity from different organs could benefit the issue, there is a lack of relevant work due to the enormous domain gap between them. To address these challenges, we propose the Cross-Organ Tumor Segmentation Networks (COTS-Nets), consisting of a universal network and an auxiliary network. The universal network utilizes boundary loss to learn common boundary information of different tumors, enabling accurate delineation of tumors in EUS despite limited and low-quality data. Simultaneously, we incorporate consistency loss in the universal network to align the prediction of pancreatic EUS with tumor boundaries from other organs to mitigate the domain gap. To further reduce the cross-organ domain gap, the auxiliary network integrates multi-scale features from different organs, aiding the universal network in acquiring domain-invariant knowledge. Systematic experiments demonstrate that COTS-Nets significantly improves the accuracy of pancreatic cancer diagnosis. Additionally, we developed the Pancreatic Cancer Endoscopic Ultrasound (PCEUS) dataset, comprising 501 pathologically confirmed pancreatic EUS images, to facilitate model development.

[LG-123] Harnessing physics-informed operators for high-dimensional reliability analysis problems

链接: https://arxiv.org/abs/2409.04708
作者: N Navaneeth,Tushar,Souvik Chakraborty
关键词-EN: formidable task, Reliability analysis, large number, number of stochastic, reliability analysis problems
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reliability analysis is a formidable task, particularly in systems with a large number of stochastic parameters. Conventional methods for quantifying reliability often rely on extensive simulations or experimental data, which can be costly and time-consuming, especially when dealing with systems governed by complex physical laws which necessitates computationally intensive numerical methods such as finite element or finite volume techniques. On the other hand, surrogate-based methods offer an efficient alternative for computing reliability by approximating the underlying model from limited data. Neural operators have recently emerged as effective surrogates for modelling physical systems governed by partial differential equations. These operators can learn solutions to PDEs for varying inputs and parameters. Here, we investigate the efficacy of the recently developed physics-informed wavelet neural operator in solving reliability analysis problems. In particular, we investigate the possibility of using physics-informed operator for solving high-dimensional reliability analysis problems, while bypassing the need for any simulation. Through four numerical examples, we illustrate that physics-informed operator can seamlessly solve high-dimensional reliability analysis problems with reasonable accuracy, while eliminating the need for running expensive simulations.

[LG-124] Enhancing Deep Learning with Optimized Gradient Descent: Bridging Numerical Methods and Neural Network Training

链接: https://arxiv.org/abs/2409.04707
作者: Yuhan Ma,Dan Sun,Erdi Gao,Ningjing Sang,Iris Li,Guanming Huang
关键词-EN: optimal system performance, pivotal scientific instrument, achieving optimal system, Optimization theory serves, Optimization theory
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Optimization theory serves as a pivotal scientific instrument for achieving optimal system performance, with its origins in economic applications to identify the best investment strategies for maximizing benefits. Over the centuries, from the geometric inquiries of ancient Greece to the calculus contributions by Newton and Leibniz, optimization theory has significantly advanced. The persistent work of scientists like Lagrange, Cauchy, and von Neumann has fortified its progress. The modern era has seen an unprecedented expansion of optimization theory applications, particularly with the growth of computer science, enabling more sophisticated computational practices and widespread utilization across engineering, decision analysis, and operations research. This paper delves into the profound relationship between optimization theory and deep learning, highlighting the omnipresence of optimization problems in the latter. We explore the gradient descent algorithm and its variants, which are the cornerstone of optimizing neural networks. The chapter introduces an enhancement to the SGD optimizer, drawing inspiration from numerical optimization methods, aiming to enhance interpretability and accuracy. Our experiments on diverse deep learning tasks substantiate the improved algorithm’s efficacy. The paper concludes by emphasizing the continuous development of optimization theory and its expanding role in solving intricate problems, enhancing computational capabilities, and informing better policy decisions.

[LG-125] A Multi-scenario Attention-based Generative Model for Personalized Blood Pressure Time Series Forecasting

链接: https://arxiv.org/abs/2409.04704
作者: Cheng Wan,Chenjie Xie,Longfei Liu,Dan Wu,Ye Li
关键词-EN: Continuous blood pressure, critical care settings, blood pressure, monitoring is essential, essential for timely
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 5 pages, 2 figures

点击查看摘要

Abstract:Continuous blood pressure (BP) monitoring is essential for timely diagnosis and intervention in critical care settings. However, BP varies significantly across individuals, this inter-patient variability motivates the development of personalized models tailored to each patient’s physiology. In this work, we propose a personalized BP forecasting model mainly using electrocardiogram (ECG) and photoplethysmogram (PPG) signals. This time-series model incorporates 2D representation learning to capture complex physiological relationships. Experiments are conducted on datasets collected from three diverse scenarios with BP measurements from 60 subjects total. Results demonstrate that the model achieves accurate and robust BP forecasts across scenarios within the Association for the Advancement of Medical Instrumentation (AAMI) standard criteria. This reliable early detection of abnormal fluctuations in BP is crucial for at-risk patients undergoing surgery or intensive care. The proposed model provides a valuable addition for continuous BP tracking to reduce mortality and improve prognosis.

[LG-126] Hierarchical Sparse Representation Clustering for High-Dimensional Data Streams

链接: https://arxiv.org/abs/2409.04698
作者: Jie Chen,Hua Mao,Yuanbiao Gou,Xi Peng
关键词-EN: high-dimensional data streams, unbounded data sequences, potentially unbounded data, data streams, Data
类目: Machine Learning (cs.LG)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:Data stream clustering reveals patterns within continuously arriving, potentially unbounded data sequences. Numerous data stream algorithms have been proposed to cluster data streams. The existing data stream clustering algorithms still face significant challenges when addressing high-dimensional data streams. First, it is intractable to measure the similarities among high-dimensional data objects via Euclidean distances when constructing and merging microclusters. Second, these algorithms are highly sensitive to the noise contained in high-dimensional data streams. In this paper, we propose a hierarchical sparse representation clustering (HSRC) method for clustering high-dimensional data streams. HSRC first employs an l_1 -minimization technique to learn an affinity matrix for data objects in individual landmark windows with fixed sizes, where the number of neighboring data objects is automatically selected. This approach ensures that highly correlated data samples within clusters are grouped together. Then, HSRC applies a spectral clustering technique to the affinity matrix to generate microclusters. These microclusters are subsequently merged into macroclusters based on their sparse similarity degrees (SSDs). Additionally, HSRC introduces sparsity residual values (SRVs) to adaptively select representative data objects from the current landmark window. These representatives serve as dictionary samples for the next landmark window. Finally, HSRC refines each macrocluster through fine-tuning. In particular, HSRC enables the detection of outliers in high-dimensional data streams via the associated SRVs. The experimental results obtained on several benchmark datasets demonstrate the effectiveness and robustness of HSRC.

[LG-127] QueryBuilder: Human-in-the-Loop Query Development for Information Retrieval

链接: https://arxiv.org/abs/2409.04667
作者: Hemanth Kandula,Damianos Karakos,Haoling Qiu,Benjamin Rozonoyer,Ian Soboroff,Lee Tarlin,Bonan Min
关键词-EN: define finer-grained queries, finer-grained queries covering, cross-lingual information retrieval, Information Retrieval, important aspects
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Frequently, users of an Information Retrieval (IR) system start with an overarching information need (a.k.a., an analytic task) and proceed to define finer-grained queries covering various important aspects (i.e., sub-topics) of that analytic task. We present a novel, interactive system called \textitQueryBuilder , which allows a novice, English-speaking user to create queries with a small amount of effort, through efficient exploration of an English development corpus in order to rapidly develop cross-lingual information retrieval queries corresponding to the user’s information needs. QueryBuilder performs near real-time retrieval of documents based on user-entered search terms; the user looks through the retrieved documents and marks sentences as relevant to the information needed. The marked sentences are used by the system as additional information in query formation and refinement: query terms (and, optionally, event features, which capture event ‘triggers’ (indicator terms) and agent/patient roles) are appropriately weighted, and a neural-based system, which better captures textual meaning, retrieves other relevant content. The process of retrieval and marking is repeated as many times as desired, giving rise to increasingly refined queries in each iteration. The final product is a fine-grained query used in Cross-Lingual Information Retrieval (CLIR). Our experiments using analytic tasks and requests from the IARPA BETTER IR datasets show that with a small amount of effort (at most 10 minutes per sub-topic), novice users can form \textituseful fine-grained queries including in languages they don’t understand. QueryBuilder also provides beneficial capabilities to the traditional corpus exploration and query formation process. A demonstration video is released at this https URL

[LG-128] IIFE: Interaction Information Based Automated Feature Engineering ICDM

链接: https://arxiv.org/abs/2409.04665
作者: Tom Overman,Diego Klabjan,Jean Utke
关键词-EN: Automated feature engineering, Automated feature, improve downstream predictive, downstream predictive performance, feature engineering
类目: Machine Learning (cs.LG)
*备注: Accepted to International Conference on Data Mining (ICDM) 2024 Abu Dhabi

点击查看摘要

Abstract:Automated feature engineering (AutoFE) is the process of automatically building and selecting new features that help improve downstream predictive performance. While traditional feature engineering requires significant domain expertise and time-consuming iterative testing, AutoFE strives to make feature engineering easy and accessible to all data science practitioners. We introduce a new AutoFE algorithm, IIFE, based on determining which feature pairs synergize well through an information-theoretic perspective called interaction information. We demonstrate the superior performance of IIFE over existing algorithms. We also show how interaction information can be used to improve existing AutoFE algorithms. Finally, we highlight several critical experimental setup issues in the existing AutoFE literature and their effects on performance.

[LG-129] Generalization vs. Memorization in the Presence of Statistical Biases in Transformers

链接: https://arxiv.org/abs/2409.04654
作者: John Mitros,Damien Teney
关键词-EN: study aims, aims to understand, ability to generalize, generalize to in-distribution, statistical biases affect
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This study aims to understand how statistical biases affect the model’s ability to generalize to in-distribution and out-of-distribution data on algorithmic tasks. Prior research indicates that transformers may inadvertently learn to rely on these spurious correlations, leading to an overestimation of their generalization capabilities. To investigate this, we evaluate transformer models on several synthetic algorithmic tasks, systematically introducing and varying the presence of these biases. We also analyze how different components of the transformer models impact their generalization. Our findings suggest that statistical biases impair the model’s performance on out-of-distribution data, providing a overestimation of its generalization capabilities. The models rely heavily on these spurious correlations for inference, as indicated by their performance on tasks including such biases.

[LG-130] Privacy-Preserving Race/Ethnicity Estimation for Algorithmic Bias Measurement in the U.S KR

链接: https://arxiv.org/abs/2409.04652
作者: Saikrishna Badrinarayanan,Osonde Osoba,Miao Cheng,Ryan Rogers,Sakshi Jain,Rahul Tandra,Natesh S. Pillai
关键词-EN: including tests, equal treatment, tests for equal, form of disaggregated, disaggregated evaluations
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Saikrishna Badrinarayanan and Osonde Osoba contributed equally to this work

点击查看摘要

Abstract:AI fairness measurements, including tests for equal treatment, often take the form of disaggregated evaluations of AI systems. Such measurements are an important part of Responsible AI operations. These measurements compare system performance across demographic groups or sub-populations and typically require member-level demographic signals such as gender, race, ethnicity, and location. However, sensitive member-level demographic attributes like race and ethnicity can be challenging to obtain and use due to platform choices, legal constraints, and cultural norms. In this paper, we focus on the task of enabling AI fairness measurements on race/ethnicity for \emphU.S. LinkedIn members in a privacy-preserving manner. We present the Privacy-Preserving Probabilistic Race/Ethnicity Estimation (PPRE) method for performing this task. PPRE combines the Bayesian Improved Surname Geocoding (BISG) model, a sparse LinkedIn survey sample of self-reported demographics, and privacy-enhancing technologies like secure two-party computation and differential privacy to enable meaningful fairness measurements while preserving member privacy. We provide details of the PPRE method and its privacy guarantees. We then illustrate sample measurement operations. We conclude with a review of open research and engineering challenges for expanding our privacy-preserving fairness measurement capabilities.

[LG-131] Stacked Universal Successor Feature Approximators for Safety in Reinforcement Learning

链接: https://arxiv.org/abs/2409.04641
作者: Ian Cannon,Washington Garcia,Thomas Gresavage,Joseph Saurine,Ian Leong,Jared Culbertson
关键词-EN: complex objective structures, involve complex objective, problems often involve, involve complex, structures that resist
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 13 pages

点击查看摘要

Abstract:Real-world problems often involve complex objective structures that resist distillation into reinforcement learning environments with a single objective. Operation costs must be balanced with multi-dimensional task performance and end-states’ effects on future availability, all while ensuring safety for other agents in the environment and the reinforcement learning agent itself. System redundancy through secondary backup controllers has proven to be an effective method to ensure safety in real-world applications where the risk of violating constraints is extremely high. In this work, we investigate the utility of a stacked, continuous-control variation of universal successor feature approximation (USFA) adapted for soft actor-critic (SAC) and coupled with a suite of secondary safety controllers, which we call stacked USFA for safety (SUSFAS). Our method improves performance on secondary objectives compared to SAC baselines using an intervening secondary controller such as a runtime assurance (RTA) controller.

[LG-132] Notes on Sampled Gaussian Mechanism

链接: https://arxiv.org/abs/2409.04636
作者: Nikita P. Kalinin
关键词-EN: Private Stochastic Optimization, Large Batch Sizes, Batch Sizes Work, Differentially Private Stochastic, recent conjecture posed
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In these notes, we prove a recent conjecture posed in the paper by Räisä, O. et al. [Subsampling is not Magic: Why Large Batch Sizes Work for Differentially Private Stochastic Optimization (2024)]. Theorem 6.2 of the paper asserts that for the Sampled Gaussian Mechanism - a composition of subsampling and additive Gaussian noise, the effective noise level, \sigma_\texteff = \frac\sigma(q)q , decreases as a function of the subsampling rate q . Consequently, larger subsampling rates are preferred for better privacy-utility trade-offs. Our notes provide a rigorous proof of Conjecture 6.3, which was left unresolved in the original paper, thereby completing the proof of Theorem 6.2.

[LG-133] Detection of False Data Injection Attacks (FDIA) on Power Dynamical Systems With a State Prediction Method

链接: https://arxiv.org/abs/2409.04609
作者: Abhijeet Sahu,Truc Nguyen,Kejun Chen,Xiangyu Zhang,Malik Hassanaly
关键词-EN: false data injection, growing cyber-security concern, data injection attacks, FDIA, FDIA detection method
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:With the deeper penetration of inverter-based resources in power systems, false data injection attacks (FDIA) are a growing cyber-security concern. They have the potential to disrupt the system’s stability like frequency stability, thereby leading to catastrophic failures. Therefore, an FDIA detection method would be valuable to protect power systems. FDIAs typically induce a discrepancy between the desired and the effective behavior of the power system dynamics. A suitable detection method can leverage power dynamics predictions to identify whether such a discrepancy was induced by an FDIA. This work investigates the efficacy of temporal and spatio-temporal state prediction models, such as Long Short-Term Memory (LSTM) and a combination of Graph Neural Networks (GNN) with LSTM, for predicting frequency dynamics in the absence of an FDIA but with noisy measurements, and thereby identify FDIA events. For demonstration purposes, the IEEE 39 New England Kron-reduced model simulated with a swing equation is considered. It is shown that the proposed state prediction models can be used as a building block for developing an effective FDIA detection method that can maintain high detection accuracy across various attack and deployment settings. It is also shown how the FDIA detection should be deployed to limit its exposure to detection inaccuracies and mitigate its computational burden.

[LG-134] Whittle Index Learning Algorithms for Restless Bandits with Constant Stepsizes

链接: https://arxiv.org/abs/2409.04605
作者: Vishesh Mittal,Rahul Meshram,Surya Prakash
关键词-EN: index learning, learning, index, Q-learning, index learning algorithm
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注: 14 pages

点击查看摘要

Abstract:We study the Whittle index learning algorithm for restless multi-armed bandits. We consider index learning algorithm with Q-learning. We first present Q-learning algorithm with exploration policies – epsilon-greedy, softmax, epsilon-softmax with constant stepsizes. We extend the study of Q-learning to index learning for single-armed restless bandit. The algorithm of index learning is two-timescale variant of stochastic approximation, on slower timescale we update index learning scheme and on faster timescale we update Q-learning assuming fixed index value. In Q-learning updates are in asynchronous manner. We study constant stepsizes two timescale stochastic approximation algorithm. We provide analysis of two-timescale stochastic approximation for index learning with constant stepsizes. Further, we present study on index learning with deep Q-network (DQN) learning and linear function approximation with state-aggregation method. We describe the performance of our algorithms using numerical examples. We have shown that index learning with Q learning, DQN and function approximations learns the Whittle index.

[LG-135] Detecting Buggy Contracts via Smart Testing

链接: https://arxiv.org/abs/2409.04597
作者: Sally Junsong Wang,Jianan Yao,Kexin Pei,Hidedaki Takahashi,Junfeng Yang
关键词-EN: critical vulnerabilities, smart contract, susceptible to critical, hybrid smart contract, smart contract dynamic
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:Smart contracts are susceptible to critical vulnerabilities. Hybrid dynamic analyses, such as concolic execution assisted fuzzing and foundation model assisted fuzzing, have emerged as highly effective testing techniques for smart contract bug detection recently. This hybrid approach has shown initial promise in real-world benchmarks, but it still suffers from low scalability to find deep bugs buried in complex code patterns. We observe that performance bottlenecks of existing dynamic analyses and model hallucination are two main factors limiting the scalability of this hybrid approach in finding deep bugs. To overcome the challenges, we design an interactive, self-deciding foundation model based system, called SmartSys, to support hybrid smart contract dynamic analyses. The key idea is to teach foundation models about performance bottlenecks of different dynamic analysis techniques, making it possible to forecast the right technique and generates effective fuzz targets that can reach deep, hidden bugs. To prune hallucinated, incorrect fuzz targets, SmartSys feeds foundation models with feedback from dynamic analysis during compilation and at runtime. The interesting results of SmartSys include: i) discovering a smart contract protocol vulnerability that has escaped eleven tools and survived multiple audits for over a year; ii) improving coverage by up to 14.3% on real-world benchmarks compared to the baselines. Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG); Programming Languages (cs.PL) Cite as: arXiv:2409.04597 [cs.SE] (or arXiv:2409.04597v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2409.04597 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-136] CubicML: Automated ML for Distributed ML Systems Co-design with ML Prediction of Performance

链接: https://arxiv.org/abs/2409.04585
作者: Wei Wen,Quanyu Zhu,Weiwei Chu,Wen-Yen Chen,Jiyan Yang
关键词-EN: Scaling up deep, deep learning models, deep learning, machine learning, proven effective
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Scaling up deep learning models has been proven effective to improve intelligence of machine learning (ML) models, especially for industry recommendation models and large language models. The co-design of distributed ML systems and algorithms (to maximize training performance) plays a pivotal role for its success. As it scales, the number of co-design hyper-parameters grows rapidly which brings challenges to feasibly find the optimal setup for system performance maximization. In this paper, we propose CubicML which uses ML to automatically optimize training performance of distributed ML systems. In CubicML, we use a ML model as a proxy to predict the training performance for search efficiency and performance modeling flexibility. We proved that CubicML can effectively optimize training speed of in-house ads recommendation models and large language models at Meta.

[LG-137] How Does Code Pretraining Affect Language Model Task Performance?

链接: https://arxiv.org/abs/2409.04556
作者: Jackson Petty,Sjoerd van Steenkiste,Tal Linzen
关键词-EN: Large language models, Large language, increasingly trained, language, Large
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models are increasingly trained on corpora containing both natural language and non-linguistic data like source code. Aside from aiding programming-related tasks, anecdotal evidence suggests that including code in pretraining corpora may improve performance on other, unrelated tasks, yet to date no work has been able to establish a causal connection by controlling between language and code data. Here we do just this. We pretrain language models on datasets which interleave natural language and code in two different settings: additive, in which the total volume of data seen during pretraining is held constant; and competitive, in which the volume of language data is held constant. We study how the pretraining mixture affects performance on (a) a diverse collection of tasks included in the BigBench benchmark, and (b) compositionality, measured by generalization accuracy on semantic parsing and syntactic transformations. We find that pretraining on higher proportions of code improves performance on compositional tasks involving structured output (like semantic parsing), and mathematics. Conversely, increase code mixture can harm performance on other tasks, including on tasks that requires sensitivity to linguistic structure such as syntax or morphology, and tasks measuring real-world knowledge.

[LG-138] owards Hybrid Embedded Feature Selection and Classification Approach with Slim-TSF

链接: https://arxiv.org/abs/2409.04542
作者: Anli Ji,Chetraj Pandey,Berkay Aydin
关键词-EN: treating flare predictions, solar flare forecasting, flare forecasting approaches, classification problem, Traditional solar flare
类目: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM); Solar and Stellar Astrophysics (astro-ph.SR)
*备注: This is a preprint accepted at the 26th International Conference on Big Data Analytics and Knowledge Discovery (DAWAK 2024)

点击查看摘要

Abstract:Traditional solar flare forecasting approaches have mostly relied on physics-based or data-driven models using solar magnetograms, treating flare predictions as a point-in-time classification problem. This approach has limitations, particularly in capturing the evolving nature of solar activity. Recognizing the limitations of traditional flare forecasting approaches, our research aims to uncover hidden relationships and the evolutionary characteristics of solar flares and their source regions. Our previously proposed Sliding Window Multivariate Time Series Forest (Slim-TSF) has shown the feasibility of usage applied on multivariate time series data. A significant aspect of this study is the comparative analysis of our updated Slim-TSF framework against the original model outcomes. Preliminary findings indicate a notable improvement, with an average increase of 5% in both the True Skill Statistic (TSS) and Heidke Skill Score (HSS). This enhancement not only underscores the effectiveness of our refined methodology but also suggests that our systematic evaluation and feature selection approach can significantly advance the predictive accuracy of solar flare forecasting models.

[LG-139] Operator Learning with Gaussian Processes

链接: https://arxiv.org/abs/2409.04538
作者: Carlos Mora,Amin Yousefpour,Shirin Hosseinmardi,Houman Owhadi,Ramin Bostanabad
关键词-EN: mathcal, dagger, Omega, Operator learning focuses, Operator learning
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 31 pages, 10 figures, 3 tables

点击查看摘要

Abstract:Operator learning focuses on approximating mappings \mathcalG^\dagger:\mathcalU \rightarrow\mathcalV between infinite-dimensional spaces of functions, such as u: \Omega_u\rightarrow\mathbbR and v: \Omega_v\rightarrow\mathbbR . This makes it particularly suitable for solving parametric nonlinear partial differential equations (PDEs). While most machine learning methods for operator learning rely on variants of deep neural networks (NNs), recent studies have shown that Gaussian Processes (GPs) are also competitive while offering interpretability and theoretical guarantees. In this paper, we introduce a hybrid GP/NN-based framework for operator learning that leverages the strengths of both methods. Instead of approximating the function-valued operator \mathcalG^\dagger , we use a GP to approximate its associated real-valued bilinear form \widetilde\mathcalG^\dagger: \mathcalU\times\mathcalV^*\rightarrow\mathbbR. This bilinear form is defined by \widetilde\mathcalG^\dagger(u,\varphi) := [\varphi,\mathcalG^\dagger(u)], which allows us to recover the operator \mathcalG^\dagger through \mathcalG^\dagger(u)(y)=\widetilde\mathcalG^\dagger(u,\delta_y). The GP mean function can be zero or parameterized by a neural operator and for each setting we develop a robust training mechanism based on maximum likelihood estimation (MLE) that can optionally leverage the physics involved. Numerical benchmarks show that (1) it improves the performance of a base neural operator by using it as the mean function of a GP, and (2) it enables zero-shot data-driven models for accurate predictions without prior training. Our framework also handles multi-output operators where \mathcalG^\dagger:\mathcalU \rightarrow\prod_s=1^S\mathcalV^s , and benefits from computational speed-ups via product kernel structures and Kronecker product matrix representations.

[LG-140] Chain-of-Translation Prompting (CoTR): A Novel Prompting Technique for Low Resource Languages

链接: https://arxiv.org/abs/2409.04512
作者: Tejas Deshpande,Nidhi Kowtal,Raviraj Joshi
关键词-EN: paper introduces Chain, Chain of Translation, introduces Chain, Translation Prompting, paper introduces
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces Chain of Translation Prompting (CoTR), a novel strategy designed to enhance the performance of language models in low-resource languages. CoTR restructures prompts to first translate the input context from a low-resource language into a higher-resource language, such as English. The specified task like generation, classification, or any other NLP function is then performed on the translated text, with the option to translate the output back to the original language if needed. All these steps are specified in a single prompt. We demonstrate the effectiveness of this method through a case study on the low-resource Indic language Marathi. The CoTR strategy is applied to various tasks, including sentiment analysis, hate speech classification, subject classification and text generation, and its efficacy is showcased by comparing it with regular prompting methods. Our results underscore the potential of translation-based prompting strategies to significantly improve multilingual LLM performance in low-resource languages, offering valuable insights for future research and applications. We specifically see the highest accuracy improvements with the hate speech detection task. The technique also has the potential to enhance the quality of synthetic data generation for underrepresented languages using LLMs.

[LG-141] Learning to Solve Combinatorial Optimization under Positive Linear Constraints via Non-Autoregressive Neural Networks

链接: https://arxiv.org/abs/2409.04495
作者: Runzhong Wang,Yang Li,Junchi Yan,Xiaokang Yang
关键词-EN: Combinatorial optimization, applied mathematics, computer science, intersection of computer, Combinatorial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: English version of the same paper published on Scientia Sinica Informationis

点击查看摘要

Abstract:Combinatorial optimization (CO) is the fundamental problem at the intersection of computer science, applied mathematics, etc. The inherent hardness in CO problems brings up challenge for solving CO exactly, making deep-neural-network-based solvers a research frontier. In this paper, we design a family of non-autoregressive neural networks to solve CO problems under positive linear constraints with the following merits. First, the positive linear constraint covers a wide range of CO problems, indicating that our approach breaks the generality bottleneck of existing non-autoregressive networks. Second, compared to existing autoregressive neural network solvers, our non-autoregressive networks have the advantages of higher efficiency and preserving permutation invariance. Third, our offline unsupervised learning has lower demand on high-quality labels, getting rid of the demand of optimal labels in supervised learning. Fourth, our online differentiable search method significantly improves the generalizability of our neural network solver to unseen problems. We validate the effectiveness of this framework in solving representative CO problems including facility location, max-set covering, and traveling salesman problem. Our non-autoregressive neural solvers are competitive to and can be even superior to state-of-the-art solvers such as SCIP and Gurobi, especially when both efficiency and efficacy are considered. Code is available at this https URL

[LG-142] Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small

链接: https://arxiv.org/abs/2409.04478
作者: Maheep Chaudhary,Atticus Geiger
关键词-EN: high-dimensional sparse autoencoders, train high-dimensional sparse, sparse autoencoders, popular new method, method in mechanistic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:A popular new method in mechanistic interpretability is to train high-dimensional sparse autoencoders (SAEs) on neuron activations and use SAE features as the atomic units of analysis. However, the body of evidence on whether SAE feature spaces are useful for causal analysis is underdeveloped. In this work, we use the RAVEL benchmark to evaluate whether SAEs trained on hidden representations of GPT-2 small have sets of features that separately mediate knowledge of which country a city is in and which continent it is in. We evaluate four open-source SAEs for GPT-2 small against each other, with neurons serving as a baseline, and linear features learned via distributed alignment search (DAS) serving as a skyline. For each, we learn a binary mask to select features that will be patched to change the country of a city without changing the continent, or vice versa. Our results show that SAEs struggle to reach the neuron baseline, and none come close to the DAS skyline. We release code here: this https URL

[LG-143] Learning in Order! A Sequential Strategy to Learn Invariant Features for Multimodal Sentiment Analysis

链接: https://arxiv.org/abs/2409.04473
作者: Xianbing Zhao,Lizhen Qu,Tao Feng,Jianfei Cai,Buzhou Tang
关键词-EN: simple sequential learning, multimodal sentiment analysis, simple sequential, sequential learning strategy, sentiment analysis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This work proposes a novel and simple sequential learning strategy to train models on videos and texts for multimodal sentiment analysis. To estimate sentiment polarities on unseen out-of-distribution data, we introduce a multimodal model that is trained either in a single source domain or multiple source domains using our learning strategy. This strategy starts with learning domain invariant features from text, followed by learning sparse domain-agnostic features from videos, assisted by the selected features learned in text. Our experimental results demonstrate that our model achieves significantly better performance than the state-of-the-art approaches on average in both single-source and multi-source settings. Our feature selection procedure favors the features that are independent to each other and are strongly correlated with their polarity labels. To facilitate research on this topic, the source code of this work will be publicly available upon acceptance.

[LG-144] State and Action Factorization in Power Grids

链接: https://arxiv.org/abs/2409.04467
作者: Gianvito Losapio,Davide Beretta,Marco Mussi,Alberto Maria Metelli,Marcello Restelli
关键词-EN: renewable energy generation, controlling power grids, operating power grids, power grids, increase of renewable
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increase of renewable energy generation towards the zero-emission target is making the problem of controlling power grids more and more challenging. The recent series of competitions Learning To Run a Power Network (L2RPN) have encouraged the use of Reinforcement Learning (RL) for the assistance of human dispatchers in operating power grids. All the solutions proposed so far severely restrict the action space and are based on a single agent acting on the entire grid or multiple independent agents acting at the substations level. In this work, we propose a domain-agnostic algorithm that estimates correlations between state and action components entirely based on data. Highly correlated state-action pairs are grouped together to create simpler, possibly independent subproblems that can lead to distinct learning processes with less computational and data requirements. The algorithm is validated on a power grid benchmark obtained with the Grid2Op simulator that has been used throughout the aforementioned competitions, showing that our algorithm is in line with domain-expert analysis. Based on these results, we lay a theoretically-grounded foundation for using distributed reinforcement learning in order to improve the existing solutions.

[LG-145] Leveraging Large Language Models for Solving Rare MIP Challenges

链接: https://arxiv.org/abs/2409.04464
作者: Teng Wang,Wing-Yin Yu,Ruifeng She,Wenhan Yang,Taijie Chen,Jianping Zhang
关键词-EN: Mixed Integer Programming, Mixed Integer, Integer Programming, tight time constraints, areas requiring mathematical
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Mixed Integer Programming (MIP) has been extensively applied in areas requiring mathematical solvers to address complex instances within tight time constraints. However, as the problem scale increases, the complexity of model formulation and finding feasible solutions escalates significantly. In contrast, the model-building cost for end-to-end models, such as large language models (LLMs), remains largely unaffected by problem scale due to their pattern recognition capabilities. While LLMs, like GPT-4, without fine-tuning, can handle some traditional medium-scale MIP problems, they struggle with uncommon or highly specialized MIP scenarios. Fine-tuning LLMs can yield some feasible solutions for medium-scale MIP instances, but these models typically fail to explore diverse solutions when constrained by a low and constant temperature, limiting their performance. In this paper, we propose and evaluate a recursively dynamic temperature method integrated with a chain-of-thought approach. Our findings show that starting with a high temperature and gradually lowering it leads to better feasible solutions compared to other dynamic temperature strategies. Additionally, by comparing results generated by the LLM with those from Gurobi, we demonstrate that the LLM can produce solutions that complement traditional solvers by accelerating the pruning process and improving overall efficiency.

[LG-146] Discovering Governing equations from Graph-Structured Data by Sparse Identification of Nonlinear Dynamical Systems

链接: https://arxiv.org/abs/2409.04463
作者: Mohammad Amin Basiri,Sina Khanmohammadi
关键词-EN: revolutionizing computational modeling, enabling direct extraction, machine learning, revolutionizing computational, combination of machine
类目: ystems and Control (eess.SY); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The combination of machine learning (ML) and sparsity-promoting techniques is enabling direct extraction of governing equations from data, revolutionizing computational modeling in diverse fields of science and engineering. The discovered dynamical models could be used to address challenges in climate science, neuroscience, ecology, finance, epidemiology, and beyond. However, most existing sparse identification methods for discovering dynamical systems treat the whole system as one without considering the interactions between subsystems. As a result, such models are not able to capture small changes in the emergent system behavior. To address this issue, we developed a new method called Sparse Identification of Nonlinear Dynamical Systems from Graph-structured data (SINDyG), which incorporates the network structure into sparse regression to identify model parameters that explain the underlying network dynamics. SINDyG discovers the governing equations of network dynamics while offering improvements in accuracy and model simplicity.

[LG-147] WET: Overcoming Paraphrasing Vulnerabilities in Embeddings-as-a-Service with Linear Transformation Watermarks

链接: https://arxiv.org/abs/2409.04459
作者: Anudeex Shetty,Qiongkai Xu,Jey Han Lau
关键词-EN: supply embeddings generated, large language model, developers to supply, generated by LLMs, service offered
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Work in Progress

点击查看摘要

Abstract:Embeddings-as-a-Service (EaaS) is a service offered by large language model (LLM) developers to supply embeddings generated by LLMs. Previous research suggests that EaaS is prone to imitation attacks – attacks that clone the underlying EaaS model by training another model on the queried embeddings. As a result, EaaS watermarks are introduced to protect the intellectual property of EaaS providers. In this paper, we first show that existing EaaS watermarks can be removed by paraphrasing when attackers clone the model. Subsequently, we propose a novel watermarking technique that involves linearly transforming the embeddings, and show that it is empirically and theoretically robust against paraphrasing.

[LG-148] Automatic Detection of LLM-generated Code: A Case Study of Claude 3 Haiku

链接: https://arxiv.org/abs/2409.01382
作者: Musfiqur Rahman,SayedHassan Khatoonabadi,Ahmad Abdellatif,Emad Shihab
关键词-EN: Large Language Models, Large Language, Claude, generating source code, Language Models
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Submitted to a journal for potential publication

点击查看摘要

Abstract:Using Large Language Models (LLMs) has gained popularity among software developers for generating source code. However, the use of LLM-generated code can introduce risks of adding suboptimal, defective, and vulnerable code. This makes it necessary to devise methods for the accurate detection of LLM-generated code. Toward this goal, we perform a case study of Claude 3 Haiku (or Claude 3 for brevity) on CodeSearchNet dataset. We divide our analyses into two parts: function-level and class-level. We extract 22 software metric features, such as Code Lines and Cyclomatic Complexity, for each level of granularity. We then analyze code snippets generated by Claude 3 and their human-authored counterparts using the extracted features to understand how unique the code generated by Claude 3 is. In the following step, we use the unique characteristics of Claude 3-generated code to build Machine Learning (ML) models and identify which features of the code snippets make them more detectable by ML models. Our results indicate that Claude 3 tends to generate longer functions, but shorter classes than humans, and this characteristic can be used to detect Claude 3-generated code with ML models with 82% and 66% accuracies for function-level and class-level snippets, respectively.

[LG-149] An Introduction to Quantum Reinforcement Learning (QRL)

链接: https://arxiv.org/abs/2409.05846
作者: Samuel Yen-Chi Chen
关键词-EN: sparked considerable interest, Recent advancements, Quantum Reinforcement Learning, reinforcement learning, machine learning
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Accepted by The 15th International Conference on ICT Convergence - ICTC 2024

点击查看摘要

Abstract:Recent advancements in quantum computing (QC) and machine learning (ML) have sparked considerable interest in the integration of these two cutting-edge fields. Among the various ML techniques, reinforcement learning (RL) stands out for its ability to address complex sequential decision-making problems. RL has already demonstrated substantial success in the classical ML community. Now, the emerging field of Quantum Reinforcement Learning (QRL) seeks to enhance RL algorithms by incorporating principles from quantum computing. This paper offers an introduction to this exciting area for the broader AI and ML community.

[LG-150] Consensus-based Distributed Quantum Kernel Learning for Speech Recognition

链接: https://arxiv.org/abs/2409.05770
作者: Kuan-Cheng Chen,Wenxuan Ma,Xiaotian Xu
关键词-EN: Quantum Kernel Learning, presents a Consensus-based, Consensus-based Distributed Quantum, quantum computing.CDQKL addresses, Kernel Learning
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a Consensus-based Distributed Quantum Kernel Learning (CDQKL) framework aimed at improving speech recognition through distributed quantum computing.CDQKL addresses the challenges of scalability and data privacy in centralized quantum kernel learning. It does this by distributing computational tasks across quantum terminals, which are connected through classical channels. This approach enables the exchange of model parameters without sharing local training data, thereby maintaining data privacy and enhancing computational efficiency. Experimental evaluations on benchmark speech emotion recognition datasets demonstrate that CDQKL achieves competitive classification accuracy and scalability compared to centralized and local quantum kernel learning models. The distributed nature of CDQKL offers advantages in privacy preservation and computational efficiency, making it suitable for data-sensitive fields such as telecommunications, automotive, and finance. The findings suggest that CDQKL can effectively leverage distributed quantum computing for large-scale machine-learning tasks.

[LG-151] LLMs Will Always Hallucinate and We Need to Live With This

链接: https://arxiv.org/abs/2409.05746
作者: Sourav Banerjee,Ayushi Agarwal,Saloni Singla
关键词-EN: Large Language Models, inherent limitations critically, Large Language, Language Models, ubiquitous across domains
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As Large Language Models become more ubiquitous across domains, it becomes important to examine their inherent limitations critically. This work argues that hallucinations in language models are not just occasional errors but an inevitable feature of these systems. We demonstrate that hallucinations stem from the fundamental mathematical and logical structure of LLMs. It is, therefore, impossible to eliminate them through architectural improvements, dataset enhancements, or fact-checking mechanisms. Our analysis draws on computational theory and Godel’s First Incompleteness Theorem, which references the undecidability of problems like the Halting, Emptiness, and Acceptance Problems. We demonstrate that every stage of the LLM process-from training data compilation to fact retrieval, intent classification, and text generation-will have a non-zero probability of producing hallucinations. This work introduces the concept of Structural Hallucination as an intrinsic nature of these systems. By establishing the mathematical certainty of hallucinations, we challenge the prevailing notion that they can be fully mitigated.

[LG-152] Real-time optimal control of high-dimensional parametrized systems by deep learning-based reduced order models

链接: https://arxiv.org/abs/2409.05709
作者: Matteo Tomasetto,Andrea Manzoni,Francesco Braghin
关键词-EN: reduced order modeling, desired target, short amount, amount of time, time is challenging
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Steering a system towards a desired target in a very short amount of time is challenging from a computational standpoint. Indeed, the intrinsically iterative nature of optimal control problems requires multiple simulations of the physical system to be controlled. Moreover, the control action needs to be updated whenever the underlying scenario undergoes variations. Full-order models based on, e.g., the Finite Element Method, do not meet these requirements due to the computational burden they usually entail. On the other hand, conventional reduced order modeling techniques such as the Reduced Basis method, are intrusive, rely on a linear superimposition of modes, and lack of efficiency when addressing nonlinear time-dependent dynamics. In this work, we propose a non-intrusive Deep Learning-based Reduced Order Modeling (DL-ROM) technique for the rapid control of systems described in terms of parametrized PDEs in multiple scenarios. In particular, optimal full-order snapshots are generated and properly reduced by either Proper Orthogonal Decomposition or deep autoencoders (or a combination thereof) while feedforward neural networks are exploited to learn the map from scenario parameters to reduced optimal solutions. Nonlinear dimensionality reduction therefore allows us to consider state variables and control actions that are both low-dimensional and distributed. After (i) data generation, (ii) dimensionality reduction, and (iii) neural networks training in the offline phase, optimal control strategies can be rapidly retrieved in an online phase for any scenario of interest. The computational speedup and the high accuracy obtained with the proposed approach are assessed on different PDE-constrained optimization problems, ranging from the minimization of energy dissipation in incompressible flows modelled through Navier-Stokes equations to the thermal active cooling in heat transfer.

[LG-153] K-Fold Causal BART for CATE Estimation

链接: https://arxiv.org/abs/2409.05665
作者: Hugo Gobato Souto,Francisco Louzada Neto
关键词-EN: Additive Regression Trees, Bayesian Additive Regression, Conditional Average Treatment, Average Treatment Effects, Conditional Average
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This research aims to propose and evaluate a novel model named K-Fold Causal Bayesian Additive Regression Trees (K-Fold Causal BART) for improved estimation of Average Treatment Effects (ATE) and Conditional Average Treatment Effects (CATE). The study employs synthetic and semi-synthetic datasets, including the widely recognized Infant Health and Development Program (IHDP) benchmark dataset, to validate the model’s performance. Despite promising results in synthetic scenarios, the IHDP dataset reveals that the proposed model is not state-of-the-art for ATE and CATE estimation. Nonetheless, the research provides several novel insights: 1. The ps-BART model is likely the preferred choice for CATE and ATE estimation due to better generalization compared to the other benchmark models - including the Bayesian Causal Forest (BCF) model, which is considered by many the current best model for CATE estimation, 2. The BCF model’s performance deteriorates significantly with increasing treatment effect heterogeneity, while the ps-BART model remains robust, 3. Models tend to be overconfident in CATE uncertainty quantification when treatment effect heterogeneity is low, 4. A second K-Fold method is unnecessary for avoiding overfitting in CATE estimation, as it adds computational costs without improving performance, 5. Detailed analysis reveals the importance of understanding dataset characteristics and using nuanced evaluation methods, 6. The conclusion of Curth et al. (2021) that indirect strategies for CATE estimation are superior for the IHDP dataset is contradicted by the results of this research. These findings challenge existing assumptions and suggest directions for future research to enhance causal inference methodologies.

[LG-154] Optimal Projections for Classification with Naive Bayes

链接: https://arxiv.org/abs/2409.05635
作者: David P. Hofmeyr,Francois Kamper,Michail M. Melonas
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-155] When resampling/reweighting improves feature learning in imbalanced classification?: A toy-model study

链接: https://arxiv.org/abs/2409.05598
作者: Tomoyuki Obuchi,Toshiyuki Tanaka
关键词-EN:
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 30 pages, 14 figures

点击查看摘要

[LG-156] Approximation Bounds for Recurrent Neural Networks with Application to Regression

链接: https://arxiv.org/abs/2409.05577
作者: Yuling Jiao,Yang Wang,Bokai Yan
关键词-EN: recurrent neural networks, deep ReLU recurrent, ReLU recurrent neural, neural networks, capacity of deep
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the approximation capacity of deep ReLU recurrent neural networks (RNNs) and explore the convergence properties of nonparametric least squares regression using RNNs. We derive upper bounds on the approximation error of RNNs for Hölder smooth functions, in the sense that the output at each time step of an RNN can approximate a Hölder function that depends only on past and current information, termed a past-dependent function. This allows a carefully constructed RNN to simultaneously approximate a sequence of past-dependent Hölder functions. We apply these approximation results to derive non-asymptotic upper bounds for the prediction error of the empirical risk minimizer in regression problem. Our error bounds achieve minimax optimal rate under both exponentially \beta -mixing and i.i.d. data assumptions, improving upon existing ones. Our results provide statistical guarantees on the performance of RNNs.

[LG-157] Advancing Machine Learning for Stellar Activity and Exoplanet Period Rotation

链接: https://arxiv.org/abs/2409.05482
作者: Fatemeh Fazel Hesar,Bernard Foing,Ana M. Heras,Mojtaba Raouf,Victoria Foing,Shima Javanmardi,Fons J. Verbeek
关键词-EN: NASA Kepler mission, light curve data, corrected light curve, curve data obtained, NASA Kepler
类目: olar and Stellar Astrophysics (astro-ph.SR); Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 15 pages, 8 figures. Submitted for publication in AA

点击查看摘要

Abstract:This study applied machine learning models to estimate stellar rotation periods from corrected light curve data obtained by the NASA Kepler mission. Traditional methods often struggle to estimate rotation periods accurately due to noise and variability in the light curve data. The workflow involved using initial period estimates from the LS-Periodogram and Transit Least Squares techniques, followed by splitting the data into training, validation, and testing sets. We employed several machine learning algorithms, including Decision Tree, Random Forest, K-Nearest Neighbors, and Gradient Boosting, and also utilized a Voting Ensemble approach to improve prediction accuracy and robustness. The analysis included data from multiple Kepler IDs, providing detailed metrics on orbital periods and planet radii. Performance evaluation showed that the Voting Ensemble model yielded the most accurate results, with an RMSE approximately 50% lower than the Decision Tree model and 17% better than the K-Nearest Neighbors model. The Random Forest model performed comparably to the Voting Ensemble, indicating high accuracy. In contrast, the Gradient Boosting model exhibited a worse RMSE compared to the other approaches. Comparisons of the predicted rotation periods to the photometric reference periods showed close alignment, suggesting the machine learning models achieved high prediction accuracy. The results indicate that machine learning, particularly ensemble methods, can effectively solve the problem of accurately estimating stellar rotation periods, with significant implications for advancing the study of exoplanets and stellar astrophysics. Comments: 15 pages, 8 figures. Submitted for publication in AA Subjects: Solar and Stellar Astrophysics (astro-ph.SR); Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG) Cite as: arXiv:2409.05482 [astro-ph.SR] (or arXiv:2409.05482v1 [astro-ph.SR] for this version) https://doi.org/10.48550/arXiv.2409.05482 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-158] Reinforcement Learning for Variational Quantum Circuits Design

链接: https://arxiv.org/abs/2409.05475
作者: Simone Foderà,Gloria Turati,Riccardo Nembrini,Maurizio Ferrari Dacrema,Paolo Cremonesi
关键词-EN: Maximum Cut problems, Maximum Cut, Variational Quantum Algorithms, emerged as promising, promising tools
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Variational Quantum Algorithms have emerged as promising tools for solving optimization problems on quantum computers. These algorithms leverage a parametric quantum circuit called ansatz, where its parameters are adjusted by a classical optimizer with the goal of optimizing a certain cost function. However, a significant challenge lies in designing effective circuits for addressing specific problems. In this study, we leverage the powerful and flexible Reinforcement Learning paradigm to train an agent capable of autonomously generating quantum circuits that can be used as ansatzes in variational algorithms to solve optimization problems. The agent is trained on diverse problem instances, including Maximum Cut, Maximum Clique and Minimum Vertex Cover, built from different graph topologies and sizes. Our analysis of the circuits generated by the agent and the corresponding solutions shows that the proposed method is able to generate effective ansatzes. While our goal is not to propose any new specific ansatz, we observe how the agent has discovered a novel family of ansatzes effective for Maximum Cut problems, which we call R_yz -connected. We study the characteristics of one of these ansatzes by comparing it against state-of-the-art quantum algorithms across instances of varying graph topologies, sizes, and problem types. Our results indicate that the R_yz -connected circuit achieves high approximation ratios for Maximum Cut problems, further validating our proposed agent. In conclusion, our study highlights the potential of Reinforcement Learning techniques in assisting researchers to design effective quantum circuits which could have applications in a wide number of tasks.

[LG-159] Recursive Nested Filtering for Efficient Amortized Bayesian Experimental Design

链接: https://arxiv.org/abs/2409.05354
作者: Sahel Iqbal,Hany Abdulsamad,Sara Pérez-Vieites,Simo Särkkä,Adrien Corenflos
关键词-EN: Nested Particle Filter, Inside-Out Nested Particle, Particle Filter, Nested Particle, Inside-Out Nested
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:This paper introduces the Inside-Out Nested Particle Filter (IO-NPF), a novel, fully recursive, algorithm for amortized sequential Bayesian experimental design in the non-exchangeable setting. We frame policy optimization as maximum likelihood estimation in a non-Markovian state-space model, achieving (at most) \mathcalO(T^2) computational complexity in the number of experiments. We provide theoretical convergence guarantees and introduce a backward sampling algorithm to reduce trajectory degeneracy. IO-NPF offers a practical, extensible, and provably consistent approach to sequential Bayesian experimental design, demonstrating improved efficiency over existing methods.

[LG-160] Robust Non-adaptive Group Testing under Errors in Group Membership Specifications

链接: https://arxiv.org/abs/2409.05345
作者: Shuvayan Banerjee,Radhendushka Srivastava,James Saunderson,Ajit Rajwade
关键词-EN: aims to determine, group membership specification, determine their defect, formed by mixing, samples
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Given p samples, each of which may or may not be defective, group testing (GT) aims to determine their defect status by performing tests on n p `groups’, where a group is formed by mixing a subset of the p samples. Assuming that the number of defective samples is very small compared to p , GT algorithms have provided excellent recovery of the status of all p samples with even a small number of groups. Most existing methods, however, assume that the group memberships are accurately specified. This assumption may not always be true in all applications, due to various resource constraints. Such errors could occur, eg, when a technician, preparing the groups in a laboratory, unknowingly mixes together an incorrect subset of samples as compared to what was specified. We develop a new GT method, the Debiased Robust Lasso Test Method (DRLT), that handles such group membership specification errors. The proposed DRLT method is based on an approach to debias, or reduce the inherent bias in, estimates produced by Lasso, a popular and effective sparse regression technique. We also provide theoretical upper bounds on the reconstruction error produced by our estimator. Our approach is then combined with two carefully designed hypothesis tests respectively for (i) the identification of defective samples in the presence of errors in group membership specifications, and (ii) the identification of groups with erroneous membership specifications. The DRLT approach extends the literature on bias mitigation of statistical estimators such as the LASSO, to handle the important case when some of the measurements contain outliers, due to factors such as group membership specification errors. We present numerical results which show that our approach outperforms several baselines and robust regression techniques for identification of defective samples as well as erroneously specified groups.

[LG-161] Label-free evaluation of lung and heart transplant biopsies using virtual staining

链接: https://arxiv.org/abs/2409.05255
作者: Yuzhu Li,Nir Pillar,Tairan Liu,Guangdong Ma,Yuxuan Qi,Kevin de Haan,Yijie Zhang,Xilin Yang,Adrian J. Correa,Guangqian Xiao,Kuang-Yu Jen,Kenneth A. Iczkowski,Yulun Wu,William Dean Wallace,Aydogan Ozcan
关键词-EN: end-stage organ failures, primary therapeutic strategy, Organ transplantation serves, Organ transplantation, organ failures
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 21 Pages, 5 Figures

点击查看摘要

Abstract:Organ transplantation serves as the primary therapeutic strategy for end-stage organ failures. However, allograft rejection is a common complication of organ transplantation. Histological assessment is essential for the timely detection and diagnosis of transplant rejection and remains the gold standard. Nevertheless, the traditional histochemical staining process is time-consuming, costly, and labor-intensive. Here, we present a panel of virtual staining neural networks for lung and heart transplant biopsies, which digitally convert autofluorescence microscopic images of label-free tissue sections into their brightfield histologically stained counterparts, bypassing the traditional histochemical staining process. Specifically, we virtually generated Hematoxylin and Eosin (HE), Masson’s Trichrome (MT), and Elastic Verhoeff-Van Gieson (EVG) stains for label-free transplant lung tissue, along with HE and MT stains for label-free transplant heart tissue. Subsequent blind evaluations conducted by three board-certified pathologists have confirmed that the virtual staining networks consistently produce high-quality histology images with high color uniformity, closely resembling their well-stained histochemical counterparts across various tissue features. The use of virtually stained images for the evaluation of transplant biopsies achieved comparable diagnostic outcomes to those obtained via traditional histochemical staining, with a concordance rate of 82.4% for lung samples and 91.7% for heart samples. Moreover, virtual staining models create multiple stains from the same autofluorescence input, eliminating structural mismatches observed between adjacent sections stained in the traditional workflow, while also saving tissue, expert time, and staining costs.

[LG-162] Empowering Bayesian Neural Networks with Functional Priors through Anchored Ensembling for Mechanics Surrogate Modeling Applications

链接: https://arxiv.org/abs/2409.05234
作者: Javad Ghorbanian,Nicholas Casaprima,Audrey Olivier
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 24 pages, 14 figures

点击查看摘要

[LG-163] Bellwether Trades: Characteristics of Trades influential in Predicting Future Price Movements in Markets

链接: https://arxiv.org/abs/2409.05192
作者: Tejas Ramdas,Martin T. Wells
关键词-EN: leverage powerful non-linear, powerful non-linear machine, non-linear machine learning, machine learning methods, optimized neural network
类目: Trading and Market Microstructure (q-fin.TR); Machine Learning (cs.LG); Econometrics (econ.EM); Machine Learning (stat.ML)
*备注: 49 Pages

点击查看摘要

Abstract:In this study, we leverage powerful non-linear machine learning methods to identify the characteristics of trades that contain valuable information. First, we demonstrate the effectiveness of our optimized neural network predictor in accurately predicting future market movements. Then, we utilize the information from this successful neural network predictor to pinpoint the individual trades within each data point (trading window) that had the most impact on the optimized neural network’s prediction of future price movements. This approach helps us uncover important insights about the heterogeneity in information content provided by trades of different sizes, venues, trading contexts, and over time.

[LG-164] Generalization of Geometric Graph Neural Networks

链接: https://arxiv.org/abs/2409.05191
作者: Zhiyang Wang,Juan Cervino,Alejandro Ribeiro
关键词-EN: sampled points, geometric graph constructed, generalization gap, underlying manifold, geometric graph
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 12 pages, 4 figures. arXiv admin note: text overlap with arXiv:2406.05225

点击查看摘要

Abstract:In this paper, we study the generalization capabilities of geometric graph neural networks (GNNs). We consider GNNs over a geometric graph constructed from a finite set of randomly sampled points over an embedded manifold with topological information captured. We prove a generalization gap between the optimal empirical risk and the optimal statistical risk of this GNN, which decreases with the number of sampled points from the manifold and increases with the dimension of the underlying manifold. This generalization gap ensures that the GNN trained on a graph on a set of sampled points can be utilized to process other unseen graphs constructed from the same underlying manifold. The most important observation is that the generalization capability can be realized with one large graph instead of being limited to the size of the graph as in previous results. The generalization gap is derived based on the non-asymptotic convergence result of a GNN on the sampled graph to the underlying manifold neural networks (MNNs). We verify this theoretical result with experiments on both Arxiv dataset and Cora dataset.

[LG-165] Learning to Classify Quantum Phases of Matter with a Few Measurements

链接: https://arxiv.org/abs/2409.05188
作者: Mehran Khosrojerdi,Jason L. Pereira,Alessandro Cuccoli,Leonardo Banchi
关键词-EN:
类目: Quantum Physics (quant-ph); Other Condensed Matter (cond-mat.other); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-166] Sliding-Window Thompson Sampling for Non-Stationary Settings

链接: https://arxiv.org/abs/2409.05181
作者: Marco Fiandri,Alberto Maria Metelli,Francesco Trovò
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 8 pages, 4 figures

点击查看摘要

[LG-167] Learning polycrystal plasticity using mesh-based subgraph geometric deep learning

链接: https://arxiv.org/abs/2409.05169
作者: Hanfeng Zhai
关键词-EN:
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 30 pages, 17 figures

点击查看摘要

[LG-168] QuantFactor REINFORCE: Mining Steady Formulaic Alpha Factors with Variance-bounded REINFORCE

链接: https://arxiv.org/abs/2409.05144
作者: Junjie Zhao,Chengxi Zhang,Min Qin,Peng Yang
关键词-EN: discover indicative signals, alpha factor mining, Alpha factors, historical financial market, factor mining methods
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 15 pages, 7 figures

点击查看摘要

Abstract:The goal of alpha factor mining is to discover indicative signals of investment opportunities from the historical financial market data of assets. Deep learning based alpha factor mining methods have shown to be powerful, which, however, lack of the interpretability, making them unacceptable in the risk-sensitive real markets. Alpha factors in formulaic forms are more interpretable and therefore favored by market participants, while the search space is complex and powerful explorative methods are urged. Recently, a promising framework is proposed for generating formulaic alpha factors using deep reinforcement learning, and quickly gained research focuses from both academia and industries. This paper first argues that the originally employed policy training method, i.e., Proximal Policy Optimization (PPO), faces several important issues in the context of alpha factors mining, making it ineffective to explore the search space of the formula. Herein, a novel reinforcement learning based on the well-known REINFORCE algorithm is proposed. Given that the underlying state transition function adheres to the Dirac distribution, the Markov Decision Process within this framework exhibit minimal environmental variability, making REINFORCE algorithm more appropriate than PPO. A new dedicated baseline is designed to theoretically reduce the commonly suffered high variance of REINFORCE. Moreover, the information ratio is introduced as a reward shaping mechanism to encourage the generation of steady alpha factors that can better adapt to changes in market volatility. Experimental evaluations on various real assets data show that the proposed algorithm can increase the correlation with asset returns by 3.83%, and a stronger ability to obtain excess returns compared to the latest alpha factors mining methods, which meets the theoretical results well.

[LG-169] Revisiting Trace Norm Minimization for Tensor Tucker Completion: A Direct Multilinear Rank Learning Approach

链接: https://arxiv.org/abs/2409.05139
作者: Xueke Tong,Hanchen Zhu,Lei Cheng,Yik-Chung Wu
关键词-EN:
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-170] me-Distributed Feature Learning for Internet of Things Network Traffic Classification

链接: https://arxiv.org/abs/2409.05096
作者: Yoga Suhas Kuruba Manjunath,Sihao Zhao,Xiao-Ping Zhang,Lian Zhao
关键词-EN:
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-171] Machine Learning-Based Prediction of Key Genes Correlated to the Subretinal Lesion Severity in a Mouse Model of Age-Related Macular Degeneration

链接: https://arxiv.org/abs/2409.05047
作者: Kuan Yan,Yue Zeng,Dai Shi,Ting Zhang,Dmytro Matsypura,Mark C. Gillies,Ling Zhu,Junbin Gao
关键词-EN: Age-related macular degeneration, severely affecting vision, Age-related macular, macular degeneration, older adults
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Age-related macular degeneration (AMD) is a major cause of blindness in older adults, severely affecting vision and quality of life. Despite advances in understanding AMD, the molecular factors driving the severity of subretinal scarring (fibrosis) remain elusive, hampering the development of effective therapies. This study introduces a machine learning-based framework to predict key genes that are strongly correlated with lesion severity and to identify potential therapeutic targets to prevent subretinal fibrosis in AMD. Using an original RNA sequencing (RNA-seq) dataset from the diseased retinas of JR5558 mice, we developed a novel and specific feature engineering technique, including pathway-based dimensionality reduction and gene-based feature expansion, to enhance prediction accuracy. Two iterative experiments were conducted by leveraging Ridge and ElasticNet regression models to assess biological relevance and gene impact. The results highlight the biological significance of several key genes and demonstrate the framework’s effectiveness in identifying novel therapeutic targets. The key findings provide valuable insights for advancing drug discovery efforts and improving treatment strategies for AMD, with the potential to enhance patient outcomes by targeting the underlying genetic mechanisms of subretinal lesion development.

[LG-172] Exploring WavLM Back-ends for Speech Spoofing and Deepfake Detection

链接: https://arxiv.org/abs/2409.05032
作者: Theophile Stourbe,Victor Miara,Theo Lepage,Reda Dehak
关键词-EN:
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

[LG-173] Asymptotic and Non-Asymptotic Convergence Analysis of AdaGrad for Non-Convex Optimization via Novel Stopping Time-based Analysis

链接: https://arxiv.org/abs/2409.05023
作者: Ruinan Jin,Xiaoyu Wang,Baoxiang Wang
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 38 pages

点击查看摘要

[LG-174] Learning nonnegative matrix factorizations from compressed data

链接: https://arxiv.org/abs/2409.04994
作者: Abraar Chaudhry,Elizaveta Rebrova
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

[LG-175] A foundation model enpowered by a multi-modal prompt engine for universal seismic geobody interpretation across surveys

链接: https://arxiv.org/abs/2409.04962
作者: Hang Gao,Xinming Wu,Luming Liang,Hanlin Sheng,Xu Si,Gao Hui,Yaxing Li
关键词-EN:
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-176] Anomaly Detection for Real-World Cyber-Physical Security using Quantum Hybrid Support Vector Machines

链接: https://arxiv.org/abs/2409.04935
作者: Tyler Cultice,Md. Saif Hassan Onim,Annarita Giani,Himanshu Thapliyal
关键词-EN:
类目: Quantum Physics (quant-ph); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 6 pages, 5 figures, 2 tables, under ISVLSI 2024 proceedings

点击查看摘要

[LG-177] Single-snapshot machine learning for turbulence super resolution

链接: https://arxiv.org/abs/2409.04923
作者: Kai Fukami,Kunihiko Taira
关键词-EN:
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

[LG-178] Nearest Neighbor CCP-Based Molecular Sequence Analysis

链接: https://arxiv.org/abs/2409.04922
作者: Sarwan Ali,Prakash Chourasia,Bipin Koirala,Murray Patterson
关键词-EN:
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-179] Efficient Training of Transformers for Molecule Property Prediction on Small-scale Datasets

链接: https://arxiv.org/abs/2409.04909
作者: Shivesh Prakash
关键词-EN:
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-180] SPIRIT: Low Power Seizure Prediction using Unsupervised Online-Learning and Zoom Analog Frontends

链接: https://arxiv.org/abs/2409.04838
作者: Aviral Pandey,Adelson Chua,Ryan Kaveh,Justin Doong,Rikky Muller
关键词-EN: improving patients’ quality, Early prediction, quality of life, vital for improving, improving patients’
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Early prediction of seizures and timely interventions are vital for improving patients’ quality of life. While seizure prediction has been shown in software-based implementations, to enable timely warnings of upcoming seizures, prediction must be done on an edge device to reduce latency. Ideally, such devices must also be low-power and track long-term drifts to minimize maintenance from the user. This work presents SPIRIT: Stochastic-gradient-descent-based Predictor with Integrated Retraining and In situ accuracy Tuning. SPIRIT is a complete system-on-a-chip (SoC) integrating an unsupervised online-learning seizure prediction classifier with eight 14.4 uW, 0.057 mm2, 90.5 dB dynamic range, Zoom Analog Frontends. SPIRIT achieves, on average, 97.5%/96.2% sensitivity/specificity respectively, predicting seizures an average of 8.4 minutes before they occur. Through its online learning algorithm, prediction accuracy improves by up to 15%, and prediction times extend by up to 7x, without any external intervention. Its classifier consumes 17.2 uW and occupies 0.14 mm2, the lowest reported for a prediction classifier by 134x in power and 5x in area. SPIRIT is also at least 5.6x more energy efficient than the state-of-the-art.

[LG-181] CrysAtom: Distributed Representation of Atoms for Crystal Property Prediction

链接: https://arxiv.org/abs/2409.04737
作者: Shrimon Mukherjee,Madhusudan Ghosh,Partha Basuchowdhuri
关键词-EN:
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-182] NapTune: Efficient Model Tuning for Mood Classification using Previous Nights Sleep Measures along with Wearable Time-series

链接: https://arxiv.org/abs/2409.04723
作者: Debaditya Shome,Nasim Montazeri Ghahjaverestan,Ali Etemad
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at ICMI 2024

点击查看摘要

[LG-183] Enhancing Quantum Security over Federated Learning via Post-Quantum Cryptography

链接: https://arxiv.org/abs/2409.04637
作者: Pingzhi Li,Tianlong Chen,Junyu Liu
关键词-EN:
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Submission for IEEE 2024 IEEE Workshop on Quantum IntelLigence, Learning Security (QUILLS), this https URL

点击查看摘要

[LG-184] raining quantum machine learning model on cloud without uploading the data

链接: https://arxiv.org/abs/2409.04602
作者: Guang Ping He
关键词-EN:
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 6 pages, 1 figure

点击查看摘要

[LG-185] DeepTTV: Deep Learning Prediction of Hidden Exoplanet From Transit Timing Variations

链接: https://arxiv.org/abs/2409.04557
作者: Chen Chen,Lingkai Kong,Gongjie Li,Molei Tao
关键词-EN:
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 13 pages, 6 figures and 5 tables submitted to AAS journals, comments welcome

点击查看摘要

[LG-186] he role of data embedding in quantum autoencoders for improved anomaly detection

链接: https://arxiv.org/abs/2409.04519
作者: Jack Y. Araz,Michael Spannowsky
关键词-EN:
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 8 pages, 5 figures, 4 tables

点击查看摘要

[LG-187] Benchmarking Estimators for Natural Experiments: A Novel Dataset and a Doubly Robust Algorithm

链接: https://arxiv.org/abs/2409.04500
作者: R. Teal Witter,Christopher Musco
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

[LG-188] Protein sequence classification using natural language processing techniques

链接: https://arxiv.org/abs/2409.04491
作者: Huma Perveen(1),Julie Weeds(2) ((1) School of Mathematical and Physical Sciences, University of Sussex, Brighton, UK, (2) School of Engineering and Informatics, University of Sussex, Brighton, UK)
关键词-EN:
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-189] Large Language Models in Drug Discovery and Development: From Disease Mechanisms to Clinical Trials

链接: https://arxiv.org/abs/2409.04481
作者: Yizhen Zheng,Huan Yee Koh,Maddie Yang,Li Li,Lauren T. May,Geoffrey I. Webb,Shirui Pan,George Church
关键词-EN:
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-190] Pattern based learning and optimisation through pricing for bin packing problem

链接: https://arxiv.org/abs/2409.04456
作者: Huayan Zhang,Ruibin Bai,Tie-Yan Liu,Jiawei Li,Bingchen Lin,Jianfeng Ren
关键词-EN:
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

信息检索

[IR-0] Benchmarking Chinese Knowledge Rectification in Large Language Models

链接: https://arxiv.org/abs/2409.05806
作者: Tianhe Lu,Jizhan Fang,Yunzhi Yao,Xin Xu,Ningyu Zhang,Huajun Chen
关键词-EN: Large Language Models, exhibit remarkable generative, remarkable generative capabilities, Language Models, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Ongoing work; code and dataset are available at this https URL

点击查看摘要

Abstract:While Large Language Models (LLMs) exhibit remarkable generative capabilities, they are not without flaws, particularly in the form of hallucinations. This issue is even more pronounced when LLMs are applied to specific languages and domains. For example, LLMs may generate nonsense information when handling Chinese ancient poetry, proverbs, or idioms, owing to the lack of specific knowledge. To this end, this paper introduces a benchmark for rectifying Chinese knowledge in LLMs via knowledge editing. Specifically, we introduce a new Chinese dataset, CKnowEdit, by collecting seven type of knowledge from various sources, including classical texts, idioms, and content from Baidu Tieba Ruozhiba, thereby accounting for the unique polyphony, antithesis, and logical constructs inherent in the Chinese language. Through the analysis of this dataset, we uncover the challenges faced by current LLMs in mastering Chinese. Furthermore, our evaluation of state-of-the-art knowledge editing techniques on this dataset unveil the substantial scope for advancement in the rectification of Chinese knowledge. Code and dataset are available at this https URL.

[IR-1] Extracting the U.S. building types from OpenStreetMap data

链接: https://arxiv.org/abs/2409.05692
作者: Henrique F. de Arruda,Sandro M. Reia,Shiyang Ruan,Kuldip S. Atwal,Hamdi Kavak,Taylor Anderson,Dieter Pfoser
关键词-EN: emergency response applications, traffic planning, population estimation, response applications, crucial for population
类目: ocial and Information Networks (cs.SI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Building type information is crucial for population estimation, traffic planning, urban planning, and emergency response applications. Although essential, such data is often not readily available. To alleviate this problem, this work creates a comprehensive dataset by providing residential/non-residential building classification covering the entire United States. We propose and utilize an unsupervised machine learning method to classify building types based on building footprints and available OpenStreetMap information. The classification result is validated using authoritative ground truth data for select counties in the U.S. The validation shows a high precision for non-residential building classification and a high recall for residential buildings. We identified various approaches to improving the quality of the classification, such as removing sheds and garages from the dataset. Furthermore, analyzing the misclassifications revealed that they are mainly due to missing and scarce metadata in OSM. A major result of this work is the resulting dataset of classifying 67,705,475 buildings. We hope that this data is of value to the scientific community, including urban and transportation planners.

[IR-2] RegNLP in Action: Facilitating Compliance Through Automated Information Retrieval and Answer Generation

链接: https://arxiv.org/abs/2409.05677
作者: Tuba Gokhan,Kexin Wang,Iryna Gurevych,Ted Briscoe
关键词-EN: governmental regulatory bodies, Natural Language Processing, compliance.Regulatory Natural Language, Regulatory Information Retrieval, issued by governmental
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Emerging Technologies (cs.ET); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Regulatory documents, issued by governmental regulatory bodies, establish rules, guidelines, and standards that organizations must adhere to for legal compliance. These documents, characterized by their length, complexity and frequent updates, are challenging to interpret, requiring significant allocation of time and expertise on the part of organizations to ensure ongoing compliance.Regulatory Natural Language Processing (RegNLP) is a multidisciplinary subfield aimed at simplifying access to and interpretation of regulatory rules and obligations. We define an Automated Question-Passage Generation task for RegNLP, create the ObliQA dataset containing 27,869 questions derived from the Abu Dhabi Global Markets (ADGM) financial regulation document collection, design a baseline Regulatory Information Retrieval and Answer Generation system, and evaluate it with RePASs, a novel evaluation metric that tests whether generated answers accurately capture all relevant obligations and avoid contradictions.

[IR-3] Enhancing Graph Contrastive Learning with Reliable and Informative Augmentation for Recommendation

链接: https://arxiv.org/abs/2409.05633
作者: Bowen Zheng,Junjie Zhang,Hongyu Lu,Yu Chen,Ming Chen,Wayne Xin Zhao,Ji-Rong Wen
关键词-EN: Graph neural network, discrete codes, collaborative information, high-order user-item relationships, contrastive
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Graph neural network (GNN) has been a powerful approach in collaborative filtering (CF) due to its ability to model high-order user-item relationships. Recently, to alleviate the data sparsity and enhance representation learning, many efforts have been conducted to integrate contrastive learning (CL) with GNNs. Despite the promising improvements, the contrastive view generation based on structure and representation perturbations in existing methods potentially disrupts the collaborative information in contrastive views, resulting in limited effectiveness of positive alignment. To overcome this issue, we propose CoGCL, a novel framework that aims to enhance graph contrastive learning by constructing contrastive views with stronger collaborative information via discrete codes. The core idea is to map users and items into discrete codes rich in collaborative information for reliable and informative contrastive view generation. To this end, we initially introduce a multi-level vector quantizer in an end-to-end manner to quantize user and item representations into discrete codes. Based on these discrete codes, we enhance the collaborative information of contrastive views by considering neighborhood structure and semantic relevance respectively. For neighborhood structure, we propose virtual neighbor augmentation by treating discrete codes as virtual neighbors, which expands an observed user-item interaction into multiple edges involving discrete codes. Regarding semantic relevance, we identify similar users/items based on shared discrete codes and interaction targets to generate the semantically relevant view. Through these strategies, we construct contrastive views with stronger collaborative information and develop a triple-view graph contrastive learning approach. Extensive experiments on four public datasets demonstrate the effectiveness of our proposed approach.

[IR-4] Rs4rs: Semantically Find Recent Publications from Top Recommendation System-Related Venues

链接: https://arxiv.org/abs/2409.05570
作者: Tri Kurniawan Wijaya,Edoardo D’Amico,Gabor Fodor,Manuel V. Loureiro
关键词-EN: web application designed, perform semantic search, Recommender Systems, top Recommender Systems, Google Scholar
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Rs4rs is a web application designed to perform semantic search on recent papers from top conferences and journals related to Recommender Systems. Current scholarly search engine tools like Google Scholar, Semantic Scholar, and ResearchGate often yield broad results that fail to target the most relevant high-quality publications. Moreover, manually visiting individual conference and journal websites is a time-consuming process that primarily supports only syntactic searches. Rs4rs addresses these issues by providing a user-friendly platform where researchers can input their topic of interest and receive a list of recent, relevant papers from top Recommender Systems venues. Utilizing semantic search techniques, Rs4rs ensures that the search results are not only precise and relevant but also comprehensive, capturing papers regardless of variations in wording. This tool significantly enhances research efficiency and accuracy, thereby benefitting the research community and public by facilitating access to high-quality, pertinent academic resources in the field of Recommender Systems. Rs4rs is available at this https URL.

[IR-5] End-to-End Learnable Item Tokenization for Generative Recommendation

链接: https://arxiv.org/abs/2409.05546
作者: Enze Liu,Bowen Zheng,Cheng Ling,Lantao Hu,Han Li,Wayne Xin Zhao
关键词-EN: directly generates item, generates item identifiers, generative, generative recommendation, Recently
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recently, generative recommendation has emerged as a promising new paradigm that directly generates item identifiers for recommendation. However, a key challenge lies in how to effectively construct item identifiers that are suitable for recommender systems. Existing methods typically decouple item tokenization from subsequent generative recommendation training, likely resulting in suboptimal performance. To address this limitation, we propose ETEGRec, a novel End-To-End Generative Recommender by seamlessly integrating item tokenization and generative recommendation. Our framework is developed based on the dual encoder-decoder architecture, which consists of an item tokenizer and a generative recommender. In order to achieve mutual enhancement between the two components, we propose a recommendation-oriented alignment approach by devising two specific optimization objectives: sequence-item alignment and preference-semantic alignment. These two alignment objectives can effectively couple the learning of item tokenizer and generative recommender, thereby fostering the mutual enhancement between the two components. Finally, we further devise an alternating optimization method, to facilitate stable and effective end-to-end learning of the entire framework. Extensive experiments demonstrate the effectiveness of our proposed framework compared to a series of traditional sequential recommendation models and generative recommendation baselines.

[IR-6] RBoard: A Unified Platform for Reproducible and Reusable Recommender System Benchmarks

链接: https://arxiv.org/abs/2409.05526
作者: Tri Kurniawan Wijaya,Edoardo D’Amico,Gabor Fodor,Manuel V. Loureiro
关键词-EN: research lacks standardized, lacks standardized benchmarks, systems research lacks, research lacks, Recommender systems research
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommender systems research lacks standardized benchmarks for reproducibility and algorithm comparisons. We introduce RBoard, a novel framework addressing these challenges by providing a comprehensive platform for benchmarking diverse recommendation tasks, including CTR prediction, Top-N recommendation, and others. RBoard’s primary objective is to enable fully reproducible and reusable experiments across these scenarios. The framework evaluates algorithms across multiple datasets within each task, aggregating results for a holistic performance assessment. It implements standardized evaluation protocols, ensuring consistency and comparability. To facilitate reproducibility, all user-provided code can be easily downloaded and executed, allowing researchers to reliably replicate studies and build upon previous work. By offering a unified platform for rigorous, reproducible evaluation across various recommendation scenarios, RBoard aims to accelerate progress in the field and establish a new standard for recommender systems benchmarking in both academia and industry. The platform is available at this https URL and the demo video can be found at this https URL.

[IR-7] DatAasee – A Metadata-Lake as Metadata Catalog for a Virtual Data-Lake

链接: https://arxiv.org/abs/2409.05512
作者: Christian Himpe
关键词-EN: distributed data sources, ever-growing problem, management for distributed, long-standing but ever-growing, Metadata management
类目: Databases (cs.DB); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Metadata management for distributed data sources is a long-standing but ever-growing problem. To counter this challenge in a research-data and library-oriented setting, this work constructs a data architecture, derived from the data-lake: the metadata-lake. A proof-of-concept implementation of this proposed metadata system is presented and evaluated as well.

[IR-8] Federated Transfer Learning Based Cooperative Wideband Spectrum Sensing with Model Pruning

链接: https://arxiv.org/abs/2409.05462
作者: Jibin Jia,Peihao Dong,Fuhui Zhou,Qihui Wu
关键词-EN: wideband spectrum sensing, wireless communication systems, empowers secondary users, high-rate wireless communication, wideband spectrum
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:For ultra-wideband and high-rate wireless communication systems, wideband spectrum sensing (WSS) is critical, since it empowers secondary users (SUs) to capture the spectrum holes for opportunistic transmission. However, WSS encounters challenges such as excessive costs of hardware and computation due to the high sampling rate, as well as robustness issues arising from scenario mismatch. In this paper, a WSS neural network (WSSNet) is proposed by exploiting multicoset preprocessing to enable the sub-Nyquist sampling, with the two dimensional convolution design specifically tailored to work with the preprocessed samples. A federated transfer learning (FTL) based framework mobilizing multiple SUs is further developed to achieve a robust model adaptable to various scenarios, which is paved by the selective weight pruning for the fast model adaptation and inference. Simulation results demonstrate that the proposed FTL-WSSNet achieves the fairly good performance in different target scenarios even without local adaptation samples.

[IR-9] Recommender Systems Algorithm Selection for Ranking Prediction on Implicit Feedback Datasets

链接: https://arxiv.org/abs/2409.05461
作者: Lukas Wegmeth,Tobias Vente,Joeran Beel
关键词-EN: recommender systems, recommender systems algorithm, implicit feedback datasets, systems algorithm selection, algorithm selection problem
类目: Information Retrieval (cs.IR)
*备注: Accepted for presentation at the 18th ACM Conference on Recommender Systems in the Late-Breaking Results Track

点击查看摘要

Abstract:The recommender systems algorithm selection problem for ranking prediction on implicit feedback datasets is under-explored. Traditional approaches in recommender systems algorithm selection focus predominantly on rating prediction on explicit feedback datasets, leaving a research gap for ranking prediction on implicit feedback datasets. Algorithm selection is a critical challenge for nearly every practitioner in recommender systems. In this work, we take the first steps toward addressing this research gap. We evaluate the NDCG@10 of 24 recommender systems algorithms, each with two hyperparameter configurations, on 72 recommender systems datasets. We train four optimized machine-learning meta-models and one automated machine-learning meta-model with three different settings on the resulting meta-dataset. Our results show that the predictions of all tested meta-models exhibit a median Spearman correlation ranging from 0.857 to 0.918 with the ground truth. We show that the median Spearman correlation between meta-model predictions and the ground truth increases by an average of 0.124 when the meta-model is optimized to predict the ranking of algorithms instead of their performance. Furthermore, in terms of predicting the best algorithm for an unknown dataset, we demonstrate that the best optimized traditional meta-model, e.g., XGBoost, achieves a recall of 48.6%, outperforming the best tested automated machine learning meta-model, e.g., AutoGluon, which achieves a recall of 47.2%.

[IR-10] Replicability Measures for Longitudinal Information Retrieval Evaluation

链接: https://arxiv.org/abs/2409.05417
作者: Jüri Keller,Timo Breuer,Philipp Schaer
关键词-EN: exposed to constant, Information Retrieval, Information, test collection, effectiveness
类目: Information Retrieval (cs.IR)
*备注: Experimental IR Meets Multilinguality, Multimodality, and Interaction - 15th International Conference of the CLEF Association, CLEF 2024, Grenoble, France, September 9-12, 2024, Proceedings. arXiv admin note: text overlap with arXiv:2308.10549

点击查看摘要

Abstract:Information Retrieval (IR) systems are exposed to constant changes in most components. Documents are created, updated, or deleted, the information needs are changing, and even relevance might not be static. While it is generally expected that the IR systems retain a consistent utility for the users, test collection evaluations rely on a fixed experimental setup. Based on the LongEval shared task and test collection, this work explores how the effectiveness measured in evolving experiments can be assessed. Specifically, the persistency of effectiveness is investigated as a replicability task. It is observed how the effectiveness progressively deteriorates over time compared to the initial measurement. Employing adapted replicability measures provides further insight into the persistence of effectiveness. The ranking of systems varies across retrieval measures and time. In conclusion, it was found that the most effective systems are not necessarily the ones with the most persistent performance.

[IR-11] A Survey of Multimodal Composite Editing and Retrieval

链接: https://arxiv.org/abs/2409.05405
作者: Suyan Li,Fuxiang Huang,Lei Zhang
关键词-EN: improve retrieval systems, Multimodal composite retrieval, composite retrieval, real world, focus of research
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注: 22 pages, 3 figures, and 11 tables

点击查看摘要

Abstract:In the real world, where information is abundant and diverse across different modalities, understanding and utilizing various data types to improve retrieval systems is a key focus of research. Multimodal composite retrieval integrates diverse modalities such as text, image and audio, etc. to provide more accurate, personalized, and contextually relevant results. To facilitate a deeper understanding of this promising direction, this survey explores multimodal composite editing and retrieval in depth, covering image-text composite editing, image-text composite retrieval, and other multimodal composite retrieval. In this survey, we systematically organize the application scenarios, methods, benchmarks, experiments, and future directions. Multimodal learning is a hot topic in large model era, and have also witnessed some surveys in multimodal learning and vision-language models with transformers published in the PAMI journal. To the best of our knowledge, this survey is the first comprehensive review of the literature on multimodal composite retrieval, which is a timely complement of multimodal fusion to existing reviews. To help readers’ quickly track this field, we build the project page for this survey, which can be found at this https URL.

[IR-12] NLLB-E5: A Scalable Multilingual Retrieval Model

链接: https://arxiv.org/abs/2409.05401
作者: Arkadeep Acharya,Rudra Murthy,Vishwajeet Kumar,Jaydeep Sen
关键词-EN: effectively supporting multiple, remains a critical, significant progress, capable of effectively, effectively supporting
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Despite significant progress in multilingual information retrieval, the lack of models capable of effectively supporting multiple languages, particularly low-resource like Indic languages, remains a critical challenge. This paper presents NLLB-E5: A Scalable Multilingual Retrieval Model. NLLB-E5 leverages the in-built multilingual capabilities in the NLLB encoder for translation tasks. It proposes a distillation approach from multilingual retriever E5 to provide a zero-shot retrieval approach handling multiple languages, including all major Indic languages, without requiring multilingual training data. We evaluate the model on a comprehensive suite of existing benchmarks, including Hindi-BEIR, highlighting its robust performance across diverse languages and tasks. Our findings uncover task and domain-specific challenges, providing valuable insights into the retrieval performance, especially for low-resource languages. NLLB-E5 addresses the urgent need for an inclusive, scalable, and language-agnostic text retrieval model, advancing the field of multilingual information access and promoting digital inclusivity for millions of users globally.

[IR-13] OneGen: Efficient One-Pass Unified Generation and Retrieval for LLMs

链接: https://arxiv.org/abs/2409.05152
作者: Jintian Zhang,Cheng Peng,Mengshu Sun,Xiang Chen,Lei Liang,Zhiqiang Zhang,Jun Zhou,Huajun Chen,Ningyu Zhang
关键词-EN: Large Language Models, Language Models, Large Language, advancements in Large, directly handling retrieval
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Work in progress; code is available at this https URL

点击查看摘要

Abstract:Despite the recent advancements in Large Language Models (LLMs), which have significantly enhanced the generative capabilities for various NLP tasks, LLMs still face limitations in directly handling retrieval tasks. However, many practical applications demand the seamless integration of both retrieval and generation. This paper introduces a novel and efficient One-pass Generation and retrieval framework (OneGen), designed to improve LLMs’ performance on tasks that require both generation and retrieval. The proposed framework bridges the traditionally separate training approaches for generation and retrieval by incorporating retrieval tokens generated autoregressively. This enables a single LLM to handle both tasks simultaneously in a unified forward pass. We conduct experiments on two distinct types of composite tasks, RAG and Entity Linking, to validate the pluggability, effectiveness, and efficiency of OneGen in training and inference. Furthermore, our results show that integrating generation and retrieval within the same context preserves the generative capabilities of LLMs while improving retrieval performance. To the best of our knowledge, OneGen is the first to enable LLMs to conduct vector retrieval during the generation.

[IR-14] A Survey on Diffusion Models for Recommender Systems

链接: https://arxiv.org/abs/2409.05033
作者: Jianghao Lin,Jiaqi Liu,Jiachen Zhu,Yunjia Xi,Chengkai Liu,Yangtian Zhang,Yong Yu,Weinan Zhang
关键词-EN: inadequate collaborative signals, made significant strides, limited generalization performance, generalization performance caused, diffusion models
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Under Review

点击查看摘要

Abstract:While traditional recommendation techniques have made significant strides in the past decades, they still suffer from limited generalization performance caused by factors like inadequate collaborative signals, weak latent representations, and noisy data. In response, diffusion models (DMs) have emerged as promising solutions for recommender systems due to their robust generative capabilities, solid theoretical foundations, and improved training stability. To this end, in this paper, we present the first comprehensive survey on diffusion models for recommendation, and draw a bird’s-eye view from the perspective of the whole pipeline in real-world recommender systems. We systematically categorize existing research works into three primary domains: (1) diffusion for data engineering encoding, focusing on data augmentation and representation enhancement; (2) diffusion as recommender models, employing diffusion models to directly estimate user preferences and rank items; and (3) diffusion for content presentation, utilizing diffusion models to generate personalized content such as fashion and advertisement creatives. Our taxonomy highlights the unique strengths of diffusion models in capturing complex data distributions and generating high-quality, diverse samples that closely align with user preferences. We also summarize the core characteristics of the adapting diffusion models for recommendation, and further identify key areas for future exploration, which helps establish a roadmap for researchers and practitioners seeking to advance recommender systems through the innovative application of diffusion models. To further facilitate the research community of recommender systems based on diffusion models, we actively maintain a GitHub repository for papers and other related resources in this rising direction this https URL.

[IR-15] Sequential Recommendation via Adaptive Robust Attention with Multi-dimensional Embeddings

链接: https://arxiv.org/abs/2409.05022
作者: Linsey Pang,Amir Hossein Raffiee,Wei Liu,Keld Lundgaard
关键词-EN: Abstract, Sequential recommendation, Sequential, achieved, significant accuracy boost
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sequential recommendation models have achieved state-of-the-art performance using self-attention mechanism. It has since been found that moving beyond only using item ID and positional embeddings leads to a significant accuracy boost when predicting the next item. In recent literature, it was reported that a multi-dimensional kernel embedding with temporal contextual kernels to capture users’ diverse behavioral patterns results in a substantial performance improvement. In this study, we further improve the sequential recommender model’s robustness and generalization by introducing a mix-attention mechanism with a layer-wise noise injection (LNI) regularization. We refer to our proposed model as adaptive robust sequential recommendation framework (ADRRec), and demonstrate through extensive experiments that our model outperforms existing self-attention architectures.

[IR-16] Incorporate LLMs with Influential Recommender System

链接: https://arxiv.org/abs/2409.04827
作者: Mingze Wang,Shuxian Bi,Wenjie Wang,Chongming Gao,Yangyang Li,Fuli Feng
关键词-EN: achieved increasing accuracy, Recommender systems, proactive recommender systems, achieved increasing, increasing accuracy
类目: Information Retrieval (cs.IR)
*备注: 5 pages, 1 figure

点击查看摘要

Abstract:Recommender systems have achieved increasing accuracy over the years. However, this precision often leads users to narrow their interests, resulting in issues such as limited diversity and the creation of echo chambers. Current research addresses these challenges through proactive recommender systems by recommending a sequence of items (called influence path) to guide user interest in the target item. However, existing methods struggle to construct a coherent influence path that builds up with items the user is likely to enjoy. In this paper, we leverage the Large Language Model’s (LLMs) exceptional ability for path planning and instruction following, introducing a novel approach named LLM-based Influence Path Planning (LLM-IPP). Our approach maintains coherence between consecutive recommendations and enhances user acceptability of the recommended items. To evaluate LLM-IPP, we implement various user simulators and metrics to measure user acceptability and path coherence. Experimental results demonstrate that LLM-IPP significantly outperforms traditional proactive recommender systems. This study pioneers the integration of LLMs into proactive recommender systems, offering a reliable and user-engaging methodology for future recommendation technologies.

[IR-17] Debias Can be Unreliable: Mitigating Bias Issue in Evaluating Debiasing Recommendation

链接: https://arxiv.org/abs/2409.04810
作者: Chengbing Wang,Wentao Shi,Jizhi Zhang,Wenjie Wang,Hang Pan,Fuli Feng
关键词-EN: Recent work, unbiased Recall evaluation, improved recommendation models, recommendation models remarkably, randomly-exposed datasets
类目: Information Retrieval (cs.IR)
*备注: 11 pages, 5 figures

点击查看摘要

Abstract:Recent work has improved recommendation models remarkably by equipping them with debiasing methods. Due to the unavailability of fully-exposed datasets, most existing approaches resort to randomly-exposed datasets as a proxy for evaluating debiased models, employing traditional evaluation scheme to represent the recommendation performance. However, in this study, we reveal that traditional evaluation scheme is not suitable for randomly-exposed datasets, leading to inconsistency between the Recall performance obtained using randomly-exposed datasets and that obtained using fully-exposed datasets. Such inconsistency indicates the potential unreliability of experiment conclusions on previous debiasing techniques and calls for unbiased Recall evaluation using randomly-exposed datasets. To bridge the gap, we propose the Unbiased Recall Evaluation (URE) scheme, which adjusts the utilization of randomly-exposed datasets to unbiasedly estimate the true Recall performance on fully-exposed datasets. We provide theoretical evidence to demonstrate the rationality of URE and perform extensive experiments on real-world datasets to validate its soundness.

[IR-18] Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models

链接: https://arxiv.org/abs/2409.04701
作者: Michael Günther,Isabelle Mohr,Bo Wang,Han Xiao
关键词-EN: cases require retrieving, require retrieving smaller, retrieving smaller portions, shorter text segments, dense vector-based retrieval
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: 4 pages, early draft

点击查看摘要

Abstract:Many use cases require retrieving smaller portions of text, and dense vector-based retrieval systems often perform better with shorter text segments, as the semantics are less likely to be “over-compressed” in the embeddings. Consequently, practitioners often split text documents into smaller chunks and encode them separately. However, chunk embeddings created in this way can lose contextual information from surrounding chunks, resulting in suboptimal representations. In this paper, we introduce a novel method called “late chunking,” which leverages long context embedding models to first embed all tokens of the long text, with chunking applied after the transformer model and just before mean pooling. The resulting chunk embeddings capture the full contextual information, leading to superior results across various retrieval tasks without the need for additional training. Moreover, our method is generic enough to be applied to any long-context embedding model.

[IR-19] QueryBuilder: Human-in-the-Loop Query Development for Information Retrieval

链接: https://arxiv.org/abs/2409.04667
作者: Hemanth Kandula,Damianos Karakos,Haoling Qiu,Benjamin Rozonoyer,Ian Soboroff,Lee Tarlin,Bonan Min
关键词-EN: define finer-grained queries, finer-grained queries covering, cross-lingual information retrieval, Information Retrieval, important aspects
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Frequently, users of an Information Retrieval (IR) system start with an overarching information need (a.k.a., an analytic task) and proceed to define finer-grained queries covering various important aspects (i.e., sub-topics) of that analytic task. We present a novel, interactive system called \textitQueryBuilder , which allows a novice, English-speaking user to create queries with a small amount of effort, through efficient exploration of an English development corpus in order to rapidly develop cross-lingual information retrieval queries corresponding to the user’s information needs. QueryBuilder performs near real-time retrieval of documents based on user-entered search terms; the user looks through the retrieved documents and marks sentences as relevant to the information needed. The marked sentences are used by the system as additional information in query formation and refinement: query terms (and, optionally, event features, which capture event ‘triggers’ (indicator terms) and agent/patient roles) are appropriately weighted, and a neural-based system, which better captures textual meaning, retrieves other relevant content. The process of retrieval and marking is repeated as many times as desired, giving rise to increasingly refined queries in each iteration. The final product is a fine-grained query used in Cross-Lingual Information Retrieval (CLIR). Our experiments using analytic tasks and requests from the IARPA BETTER IR datasets show that with a small amount of effort (at most 10 minutes per sub-topic), novice users can form \textituseful fine-grained queries including in languages they don’t understand. QueryBuilder also provides beneficial capabilities to the traditional corpus exploration and query formation process. A demonstration video is released at this https URL

[IR-20] Preserving Individuality while Following the Crowd: Understanding the Role of User Taste and Crowd Wisdom in Online Product Rating Prediction

链接: https://arxiv.org/abs/2409.04649
作者: Liang Wang,Shubham Jain,Yingtong Dou,Junpeng Wang,Chin-Chia Michael Yeh,Yujie Fan,Prince Aboagye,Yan Zheng,Xin Dai,Zhongfang Zhuang,Uday Singh Saini,Wei Zhang
关键词-EN: remains largely unexplored, score remains largely, Numerous algorithms, final prediction score, prediction score remains
类目: ocial and Information Networks (cs.SI); Information Retrieval (cs.IR)
*备注: Preprint

点击查看摘要

Abstract:Numerous algorithms have been developed for online product rating prediction, but the specific influence of user and product information in determining the final prediction score remains largely unexplored. Existing research often relies on narrowly defined data settings, which overlooks real-world challenges such as the cold-start problem, cross-category information utilization, and scalability and deployment issues. To delve deeper into these aspects, and particularly to uncover the roles of individual user taste and collective wisdom, we propose a unique and practical approach that emphasizes historical ratings at both the user and product levels, encapsulated using a continuously updated dynamic tree representation. This representation effectively captures the temporal dynamics of users and products, leverages user information across product categories, and provides a natural solution to the cold-start problem. Furthermore, we have developed an efficient data processing strategy that makes this approach highly scalable and easily deployable. Comprehensive experiments in real industry settings demonstrate the effectiveness of our approach. Notably, our findings reveal that individual taste dominates over collective wisdom in online product rating prediction, a perspective that contrasts with the commonly observed wisdom of the crowd phenomenon in other domains. This dominance of individual user taste is consistent across various model types, including the boosting tree model, recurrent neural network (RNN), and transformer-based architectures. This observation holds true across the overall population, within individual product categories, and in cold-start scenarios. Our findings underscore the significance of individual user tastes in the context of online product rating prediction and the robustness of our approach across different model architectures.

[IR-21] A Unified Framework for Cross-Domain Recommendation

链接: https://arxiv.org/abs/2409.04540
作者: Jiangxia Cao,Shen Wang,Gaode Chen,Rui Huang,Shuang Yang,Zhaojie Liu,Guorui Zhou
关键词-EN: domain-expert recommender systems, Cross-Domain Recommendation, recommender systems, promising methodology, addressing the persistent
类目: Information Retrieval (cs.IR)
*备注: Work in progress

点击查看摘要

Abstract:In addressing the persistent challenges of data-sparsity and cold-start issues in domain-expert recommender systems, Cross-Domain Recommendation (CDR) emerges as a promising methodology. CDR aims at enhancing prediction performance in the target domain by leveraging interaction knowledge from related source domains, particularly through users or items that span across multiple domains (e.g., Short-Video and Living-Room). For academic research purposes, there are a number of distinct aspects to guide CDR method designing, including the auxiliary domain number, domain-overlapped element, user-item interaction types, and downstream tasks. With so many different CDR combination scenario settings, the proposed scenario-expert approaches are tailored to address a specific vertical CDR scenario, and often lack the capacity to adapt to multiple horizontal scenarios. In an effect to coherently adapt to various scenarios, and drawing inspiration from the concept of domain-invariant transfer learning, we extend the former SOTA model UniCDR in five different aspects, named as UniCDR+. Our work was successfully deployed on the Kuaishou Living-Room RecSys.

附件下载

点击下载今日全部论文列表