本篇博文主要展示 2024-09-09 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天10:30左右邮件定时自动发送。

目录

概览 (2024-09-09)

今日共更新299篇论文,其中:

  • 自然语言处理34篇(Computation and Language (cs.CL))
  • 人工智能65篇(Artificial Intelligence (cs.AI))
  • 计算机视觉81篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习85篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] RLPF: Reinforcement Learning from Prediction Feedback for User Summarization with LLMs
[NLP-0] RIPL:通过LLM进行用户摘要的预测反馈强化学习

链接: https://arxiv.org/abs/2409.04421
作者: Jiaxing Wu,Lin Ning,Luyang Liu,Harrison Lee,Neo Wu,Chao Wang,Sushant Prakash,Shawn O’Banion,Bradley Green,Jun Xie
关键词-EN: Large Language Models, employ Large Language, Language Models, Large Language, systems employ Large
关键词-ZH: 大型语言模型,采用大型语言,语言模型,大型语言,系统采用大型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLM-powered personalization agent systems employ Large Language Models (LLMs) to predict users’ behavior from their past activities. However, their effectiveness often hinges on the ability to effectively leverage extensive, long user historical data due to its inherent noise and length of such data. Existing pretrained LLMs may generate summaries that are concise but lack the necessary context for downstream tasks, hindering their utility in personalization systems. To address these challenges, we introduce Reinforcement Learning from Prediction Feedback (RLPF). RLPF fine-tunes LLMs to generate concise, human-readable user summaries that are optimized for downstream task performance. By maximizing the usefulness of the generated summaries, RLPF effectively distills extensive user history data while preserving essential information for downstream tasks. Our empirical evaluation demonstrates significant improvements in both extrinsic downstream task utility and intrinsic summary quality, surpassing baseline methods by up to 22% on downstream task performance and achieving an up to 84.59% win rate on Factuality, Abstractiveness, and Readability. RLPF also achieves a remarkable 74% reduction in context length while improving performance on 16 out of 19 unseen tasks and/or datasets, showcasing its generalizability. This approach offers a promising solution for enhancing LLM personalization by effectively transforming long, noisy user histories into informative and human-readable representations.
摘要:基于LLM的个性化代理系统使用大型语言模型(LLM)来预测用户的过去行为。然而,由于其固有的噪声和这些数据的长度,它们的有效性通常取决于有效利用大量、长期的用户历史数据的能力。现有的预先训练的LLM可能会生成简明的摘要,但缺乏下游任务的必要上下文,从而阻碍了它们在个性化系统中的应用。为了应对这些挑战,我们引入了来自预测反馈的强化学习(RLPF)。RLPF对LLM进行微调,以生成针对下游任务性能进行优化的简明、人类可读的用户摘要。通过最大化生成的摘要的有用性,RLPF有效地提取了大量的用户历史数据,同时保留了下游任务的基本信息。我们的经验评估表明,在外部下游任务效用和内在摘要质量方面都有显著改善,下游任务绩效比基线方法高出22%,在真实性、抽象性和可读性方面的胜率高达84.59%。RLPF还显著减少了74%的上下文长度,同时提高了19个未见任务和/或数据集中的16个的性能,显示了其通用性。这种方法通过有效地将冗长、嘈杂的用户历史转换为信息丰富且人类可读的表示形式,为增强LLM个性化提供了一种很有前途的解决方案。

[NLP-1] Empirical Bayesian image restoration by Langevin sampling with a denoising diffusion implicit prior
[NLP-1] 通过带去噪扩散隐先验的Langevin采样进行经验Bayesian图像恢复

链接: https://arxiv.org/abs/2409.04384
作者: Charlesquin Kemajou Mbakam,Jean-Francois Giovannelli,Marcelo Pereyra
关键词-EN: Score-based diffusion methods, Score-based diffusion, pre-trained foundational prior, diffusion methods provide, provide a powerful
关键词-ZH: 基于分数的扩散方法,基于分数的扩散,预先训练的基础先验,扩散方法提供,提供了强大的
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Image and Video Processing (eess.IV)
备注: 24 pages

点击查看摘要

Abstract:Score-based diffusion methods provide a powerful strategy to solve image restoration tasks by flexibly combining a pre-trained foundational prior model with a likelihood function specified during test time. Such methods are predominantly derived from two stochastic processes: reversing Ornstein-Uhlenbeck, which underpins the celebrated denoising diffusion probabilistic models (DDPM) and denoising diffusion implicit models (DDIM), and the Langevin diffusion process. The solutions delivered by DDPM and DDIM are often remarkably realistic, but they are not always consistent with measurements because of likelihood intractability issues and the associated required approximations. Alternatively, using a Langevin process circumvents the intractable likelihood issue, but usually leads to restoration results of inferior quality and longer computing times. This paper presents a novel and highly computationally efficient image restoration method that carefully embeds a foundational DDPM denoiser within an empirical Bayesian Langevin algorithm, which jointly calibrates key model hyper-parameters as it estimates the model’s posterior mean. Extensive experimental results on three canonical tasks (image deblurring, super-resolution, and inpainting) demonstrate that the proposed approach improves on state-of-the-art strategies both in image estimation accuracy and computing time.
摘要:基于分数的扩散方法通过灵活地将预先训练的基本先验模型与测试时间中指定的似然函数相结合,为解决图像恢复任务提供了一种强有力的策略。这些方法主要来源于两个随机过程:逆Ornstein-Uhlenbeck和朗之万扩散过程,前者是著名的去噪扩散概率模型(DDPM)和去噪扩散隐式模型(DDIM)的基础。DDPM和DDIM提供的解决方案通常非常现实,但它们并不总是与测量结果一致,因为可能存在难以处理的问题和相关的所需近似。或者,使用朗之万法避免了难以解决的似然问题,但通常会导致质量较差的恢复结果和较长的计算时间。提出了一种计算效率高的图像恢复方法,该方法在经验贝叶斯-朗之万算法中嵌入了一个基本的DDPM去噪器,该算法在估计模型的后验均值时联合校准关键的模型超参数。在三种典型任务(图像去模糊、超分辨率和修复)上的大量实验结果表明,该方法在图像估计精度和计算时间方面都优于现有的方法。

[NLP-2] AGR: Age Group fairness Reward for Bias Mitigation in LLMs
[NLP-2] AGR:LLM中缓解偏见的年龄组公平奖励

链接: https://arxiv.org/abs/2409.04340
作者: Shuirong Cao,Ruoxi Cheng,Zhiqiang Wang
关键词-EN: exhibit age biases, resulting in unequal, unequal treatment, treatment of individuals, age
关键词-ZH: 表现出年龄偏见,导致不平等、不平等的待遇、个人待遇、年龄
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: The first two authors contributed equally to this work. Corresponding to Zhiqiang Wang. ACKNOWLEDGMENT: we would like to thank the computing resources support from the State Key Laboratory of New Computer Software Technologies at Nanjing University

点击查看摘要

Abstract:LLMs can exhibit age biases, resulting in unequal treatment of individuals across age groups. While much research has addressed racial and gender biases, age bias remains little explored. The scarcity of instruction-tuning and preference datasets for age bias hampers its detection and measurement, and existing fine-tuning methods seldom address age-related fairness. In this paper, we construct age bias preference datasets and instruction-tuning datasets for RLHF. We introduce ARG, an age fairness reward to reduce differences in the response quality of LLMs across different age groups. Extensive experiments demonstrate that this reward significantly improves response accuracy and reduces performance disparities across age groups. Our source code and datasets are available at the anonymous \hrefhttps://anonymous.4open.science/r/FairRLHF-D445/readme.mdlink.
摘要:LLM可能会表现出年龄偏见,导致不同年龄段的个人受到不平等的待遇。虽然许多研究都解决了种族和性别偏见,但年龄偏见仍然很少被探讨。年龄偏见的描述调整和偏好数据集的稀缺阻碍了其检测和测量,而现有的微调方法很少解决与年龄相关的公平性。本文中,我们构建了RL HF的年龄偏差偏好数据集和描述调整数据集。我们引入ARG,这是一种年龄公平奖励,以减少不同年龄组LLM响应质量的差异。大量实验表明,这种奖励显着提高了反应准确性,并减少了不同年龄组的表现差异。我们的源代码和数据集可在匿名\hrefhttps://www.example.com上获取。anonymous.4open.science/r/FairRLHF-D445/readme.mdlink

[NLP-3] Learning vs Retrieval: The Role of In-Context Examples in Regression with LLMs
[NLP-3] 学习与检索:上下文示例在LLM回归中的作用

链接: https://arxiv.org/abs/2409.04318
作者: Aliakbar Nafar,Kristen Brent Venable,Parisa Kordjamshidi
关键词-EN: Generative Large Language, Large Language Models, Generative Large, Large Language, Language Models
关键词-ZH: 生成式大型语言、大型语言模型、生成式大型、大型语言、语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative Large Language Models (LLMs) are capable of being in-context learners. However, the underlying mechanism of in-context learning (ICL) is still a major research question, and experimental research results about how models exploit ICL are not always consistent. In this work, we propose a framework for evaluating in-context learning mechanisms, which we claim are a combination of retrieving internal knowledge and learning from in-context examples by focusing on regression tasks. First, we show that LLMs can perform regression on real-world datasets and then design experiments to measure the extent to which the LLM retrieves its internal knowledge versus learning from in-context examples. We argue that this process lies on a spectrum between these two extremes. We provide an in-depth analysis of the degrees to which these mechanisms are triggered depending on various factors, such as prior knowledge about the tasks and the type and richness of the information provided by the in-context examples. We employ three LLMs and utilize multiple datasets to corroborate the robustness of our findings. Our results shed light on how to engineer prompts to leverage meta-learning from in-context examples and foster knowledge retrieval depending on the problem being addressed.
摘要:生成性大语言模型能够作为语境中的学习者。然而,情境学习(ICL)的潜在机制仍然是一个主要的研究问题,关于模型如何利用ICL的实验研究结果并不总是一致的。在这项工作中,我们提出了一个评估情境中学习机制的框架,我们认为这是检索内部知识和通过关注回归任务从情境中的例子中学习的组合。首先,我们证明了LLM可以对真实世界的数据集执行回归,然后设计实验来衡量LLM检索其内部知识的程度,而不是从上下文中的示例中学习。我们认为,这一过程介于这两个极端之间。我们深入分析了这些机制被触发的程度取决于各种因素,例如关于任务的先验知识以及由上下文中的例子提供的信息的类型和丰富程度。我们使用了三个LLM并利用多个数据集来证实我们的发现的健壮性。我们的结果揭示了如何设计提示,以利用来自上下文中的示例的元学习,并根据所解决的问题促进知识检索。

[NLP-4] Using Large Language Models to Generate Authentic Multi-agent Knowledge Work Datasets
[NLP-4] 使用大型语言模型生成真实的多智能体知识工作数据集

链接: https://arxiv.org/abs/2409.04286
作者: Desiree Heim,Christian Jilek,Adrian Ulges,Andreas Dengel
关键词-EN: collections lack diversity, Current publicly, data collections lack, knowledge work, work data collections
关键词-ZH: 收藏缺乏多样性,当前公开,缺乏数据收藏,知识工作,工作数据收藏
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted and in press (INFORMATIK Festival, Wiesbaden, 2024)

点击查看摘要

Abstract:Current publicly available knowledge work data collections lack diversity, extensive annotations, and contextual information about the users and their documents. These issues hinder objective and comparable data-driven evaluations and optimizations of knowledge work assistance systems. Due to the considerable resources needed to collect such data in real-life settings and the necessity of data censorship, collecting such a dataset appears nearly impossible. For this reason, we propose a configurable, multi-agent knowledge work dataset generator. This system simulates collaborative knowledge work among agents producing Large Language Model-generated documents and accompanying data traces. Additionally, the generator captures all background information, given in its configuration or created during the simulation process, in a knowledge graph. Finally, the resulting dataset can be utilized and shared without privacy or confidentiality concerns. This paper introduces our approach’s design and vision and focuses on generating authentic knowledge work documents using Large Language Models. Our study involving human raters who assessed 53% of the generated and 74% of the real documents as realistic demonstrates the potential of our approach. Furthermore, we analyze the authenticity criteria mentioned in the participants’ comments and elaborate on potential improvements for identified common issues. Comments: Accepted and in press (INFORMATIK Festival, Wiesbaden, 2024) Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2409.04286 [cs.AI] (or arXiv:2409.04286v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2409.04286 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:当前公开可用的知识工作数据集缺乏多样性、广泛的注释和关于用户及其文档的上下文信息。这些问题妨碍了以客观和可比较的数据为导向的评价和知识工作辅助系统的优化。由于在现实生活中收集这些数据需要相当多的资源,而且数据审查的必要性,收集这样的数据集似乎几乎是不可能的。为此,我们提出了一个可配置的、多智能体的知识工作数据集生成器。该系统模拟代理之间的协作知识工作,生成大型语言模型生成的文档和伴随的数据跟踪。此外,生成器在知识图中捕获在其配置中给出的或在模拟过程中创建的所有背景信息。最后,生成的数据集可以在没有隐私或机密性问题的情况下使用和共享。本文介绍了我们的方法的设计和愿景,并专注于使用大型语言模型生成真实的知识工作文档。我们的研究涉及人类评分者,他们将53%的生成文档和74%的真实文档评估为现实,证明了我们方法的潜力。此外,我们分析了参与者评论中提到的真实性标准,并阐述了对已确定的常见问题的潜在改进。评论:接受和在媒体上(信息节,威斯巴登,2024年)主题:人工智能(cs.AI);计算和语言(cs.CL)引用为:arxiv:2409.04286cs.AIhttps://doi.org/10.48550/arXiv.2409.04286 Focus通过DataCite了解更多arxiv发布的DOI(等待注册)

[NLP-5] Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak
[NLP-5] 开放语言数据计划:推进Karakalpak的低资源机器翻译

链接: https://arxiv.org/abs/2409.04269
作者: Mukhammadsaid Mamasaidov,Abror Shopulatov
关键词-EN: devtest dataset translated, open-sourced fine-tuned neural, fine-tuned neural models, devtest dataset, parallel corpora
关键词-ZH: DevTest数据集翻译、开源微调神经、微调神经模型、DevTest数据集、并行文集
类目: Computation and Language (cs.CL)
备注: Submitted to WMT 2024

点击查看摘要

Abstract:This study presents several contributions for the Karakalpak language: a FLORES+ devtest dataset translated to Karakalpak, parallel corpora for Uzbek-Karakalpak, Russian-Karakalpak and English-Karakalpak of 100,000 pairs each and open-sourced fine-tuned neural models for translation across these languages. Our experiments compare different model variants and training approaches, demonstrating improvements over existing baselines. This work, conducted as part of the Open Language Data Initiative (OLDI) shared task, aims to advance machine translation capabilities for Karakalpak and contribute to expanding linguistic diversity in NLP technologies.
摘要:这项研究提出了对Karakalpak语言的几项贡献:一个FLORES+ DevTest数据集被翻译为Karakalpak,一个乌兹别克-Karakalpak、俄语-Karakalpak和英语-Karakalpak的平行库,每个100,000对,以及用于跨这些语言翻译的开源微调神经模型。我们的实验比较了不同的模型变体和训练方法,展示了对现有基线的改进。这项工作是开放语言数据计划(OLDI)共享任务的一部分,旨在提高Karakalpak的机器翻译能力,并为扩大NLP技术的语言多样性做出贡献。

[NLP-6] An overview of domain-specific foundation model: key technologies applications and challenges
[NLP-6] 特定领域基础模型概述:关键技术应用与挑战

链接: https://arxiv.org/abs/2409.04267
作者: Haolong Chen,Hanzhi Chen,Zijian Zhao,Kaifeng Han,Guangxu Zhu,Yichen Zhao,Ying Du,Wei Xu,Qingjiang Shi
关键词-EN: human language understanding, domain-specific foundation models, products in human, application scenarios, foundation models
关键词-ZH: 人类语言理解、特定领域的基础模型、人类产品、应用场景、基础模型
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The impressive performance of ChatGPT and other foundation-model-based products in human language understanding has prompted both academia and industry to explore how these models can be tailored for specific industries and application scenarios. This process, known as the customization of domain-specific foundation models, addresses the limitations of general-purpose models, which may not fully capture the unique patterns and requirements of domain-specific data. Despite its importance, there is a notable lack of comprehensive overview papers on building domain-specific foundation models, while numerous resources exist for general-purpose models. To bridge this gap, this article provides a timely and thorough overview of the methodology for customizing domain-specific foundation models. It introduces basic concepts, outlines the general architecture, and surveys key methods for constructing domain-specific models. Furthermore, the article discusses various domains that can benefit from these specialized models and highlights the challenges ahead. Through this overview, we aim to offer valuable guidance and reference for researchers and practitioners from diverse fields to develop their own customized foundation models.
摘要:ChatGPT和其他基于基础模型的产品在人类语言理解方面的令人印象深刻的表现促使学术界和工业界探索如何为特定的行业和应用场景定制这些模型。这个过程称为特定于域的基础模型的定制,它解决了通用模型的局限性,因为通用模型可能不能完全捕获特定于域的数据的独特模式和需求。尽管它很重要,但在构建特定于领域的基础模型方面明显缺乏全面的概述论文,而通用模型则存在大量资源。为了弥补这一差距,本文及时而全面地概述了定制特定于领域的基础模型的方法。它介绍了基本概念,概述了一般体系结构,并概述了构建特定于领域的模型的关键方法。此外,本文还讨论了可以从这些专门模型中受益的各个领域,并强调了未来的挑战。通过这一概述,我们旨在为来自不同领域的研究人员和实践者开发自己的定制基础模型提供有价值的指导和参考。

[NLP-7] Fast Forwarding Low-Rank Training
[NLP-7] 快速转发低级别培训

链接: https://arxiv.org/abs/2409.04206
作者: Adir Rahamim,Naomi Saphra,Sara Kangaslahti,Yonatan Belinkov
关键词-EN: pretrained Language Models, Fast Forward, finetuning pretrained Language, Parameter efficient finetuning, pretrained Language
关键词-ZH: 预训练语言模型、快进、微调预训练语言、参数高效微调、预训练语言
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Parameter efficient finetuning methods like low-rank adaptation (LoRA) aim to reduce the computational costs of finetuning pretrained Language Models (LMs). Enabled by these low-rank settings, we propose an even more efficient optimization strategy: Fast Forward, a simple and effective approach to accelerate large segments of training. In a Fast Forward stage, we repeat the most recent optimizer step until the loss stops improving on a tiny validation set. By alternating between regular optimization steps and Fast Forward stages, Fast Forward provides up to an 87% reduction in FLOPs and up to an 81% reduction in train time over standard SGD with Adam. We validate Fast Forward by finetuning various models on different tasks and demonstrate that it speeds up training without compromising model performance. Additionally, we analyze when and how to apply Fast Forward.
摘要:低等级自适应(LoRA)等参数高效微调方法旨在降低微调预训练语言模型(LM)的计算成本。在这些低级别设置的支持下,我们提出了一种更有效的优化策略:快进,这是一种加速大部分训练的简单有效方法。在快进阶段,我们重复最近的优化器步骤,直到微小验证集的损失停止改善。通过在常规优化步骤和快进阶段之间交替进行,Fast Forward与Adam的标准新元相比,可将FLOP减少高达87%,并将列车时间减少高达81%。我们通过微调不同任务的各种模型来验证快进,并证明它可以在不影响模型性能的情况下加快训练速度。此外,我们还分析了何时以及如何应用快进。

[NLP-8] Residual Stream Analysis with Multi-Layer SAEs
[NLP-8] 多层严重不良事件的剩余流分析

链接: https://arxiv.org/abs/2409.04185
作者: Tim Lawson,Lucy Farnik,Conor Houghton,Laurence Aitchison
关键词-EN: Sparse autoencoders, approach to interpreting, interpreting the internal, internal representations, single SAE
关键词-ZH: 稀疏自动编码器、解释方法、解释内部、内部表示、单一严重不良事件
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 16 pages, 12 figures

点击查看摘要

Abstract:Sparse autoencoders (SAEs) are a promising approach to interpreting the internal representations of transformer language models. However, standard SAEs are trained separately on each transformer layer, making it difficult to use them to study how information flows across layers. To solve this problem, we introduce the multi-layer SAE (MLSAE): a single SAE trained on the residual stream activation vectors from every transformer layer simultaneously. The residual stream is usually understood as preserving information across layers, so we expected to, and did, find individual SAE features that are active at multiple layers. Interestingly, while a single SAE feature is active at different layers for different prompts, for a single prompt, we find that a single feature is far more likely to be active at a single layer. For larger underlying models, we find that the cosine similarities between adjacent layers in the residual stream are higher, so we expect more features to be active at multiple layers. These results show that MLSAEs are a promising method to study information flow in transformers. We release our code to train and analyze MLSAEs at this https URL.
摘要:稀疏自动编码器(SAE)是解释转换器语言模型内部表示的一种很有前途的方法。然而,标准的SAE是在每个变压器层上单独训练的,这使得使用它们来研究信息如何跨层流动变得困难。为了解决这个问题,我们引入了多层SAE(MLSAE):单个SAE同时对来自每个变压器层的剩余流激活向量进行训练。残差流通常被理解为跨层保留信息,因此我们期望并确实找到在多个层上活动的单个SAE功能。有趣的是,尽管对于不同的提示,单个SAE功能在不同层处于活动状态,但是对于单个提示,我们发现单个功能在单个层处于活动状态的可能性要大得多。对于较大的底层模型,我们发现残差流中相邻层之间的余弦相似度较高,因此我们预计多个层上会有更多活跃的特征。这些结果表明,MLSAE是一种很有前途的研究变压器信息流的方法。我们在这个HTTPS URL上发布代码来训练和分析MLSAE。

[NLP-9] GALLa: Graph Aligned Large Language Models for Improved Source Code Understanding
[NLP-9] GALLa:图形对齐的大型语言模型,用于改进源代码理解

链接: https://arxiv.org/abs/2409.04183
作者: Ziyin Zhang,Hang Yu,Shijie Li,Peng Di,Jianguo Li,Rui Wang
关键词-EN: Programming languages possess, possess rich semantic, languages possess rich, rich semantic information, Programming languages
关键词-ZH: 编程语言拥有,拥有丰富的语义,语言拥有丰富的语义信息,编程语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Programming languages possess rich semantic information such as data flow that is represented by graphs and not available from the surface form of source code. Recent code language models have scaled to billions of parameters, but model source code solely as text tokens while ignoring any other structural information. Conversely, models that do encode structural information of code make modifications to the Transformer architecture, limiting their scale and compatibility with pretrained LLMs. In this work, we take the best of both worlds with GALLa - Graph Aligned Large Language Model. GALLa utilizes graph neural networks and cross-modal alignment technologies to inject the structural information of code into LLMs as an auxiliary task during finetuning. This framework is both model-agnostic and task-agnostic, as it can be applied to any code LLM for any code downstream task, and requires the structural graph data only at training time from a corpus unrelated to the finetuning data, while incurring no cost at inference time over the baseline LLM. Experiments on five code tasks with four different baseline LLMs ranging in size from 350M to 8B validate the effectiveness of GALLa, demonstrating consistent improvement over the baseline, even for powerful models such as LLaMA3.
摘要:编程语言拥有丰富的语义信息,如用图形表示的数据流,这些信息是源代码的表层形式所不具备的。最近的代码语言模型已经扩展到数十亿个参数,但只将源代码建模为文本标记,而忽略了任何其他结构信息。相反,对代码的结构信息进行编码的模型会对Transformer体系结构进行修改,从而限制了它们的规模和与预先训练的LLM的兼容性。在这项工作中,我们利用GALA-GRAPH对齐的大型语言模型兼收并蓄。GALA利用图神经网络和跨模式对齐技术将代码的结构信息注入到LLMS中,作为精调的辅助任务。该框架既是模型不可知的,也是任务不可知的,因为它可以应用于任何代码下游任务的任何代码LLM,并且仅在训练时需要来自与精调数据无关的语料库的结构图数据,而在推理时不会产生超过基线LLM的成本。在四个大小从350M到8B的不同基线LLM上的五个代码任务上的实验验证了GALA的有效性,表明即使是对于LLaMA3这样功能强大的模型,GALA也比基线有了持续的改善。

[NLP-10] Combining LLMs and Knowledge Graphs to Reduce Hallucinations in Question Answering
[NLP-10] 结合法学硕士和知识图减少问题解答中的幻觉

链接: https://arxiv.org/abs/2409.04181
作者: Larissa Pusch,Tim O. F. Conrad
关键词-EN: digital information systems, natural language processing, processing have revolutionized, interact with digital, Large Language Models
关键词-ZH: 数字信息系统、自然语言处理、处理发生了革命性的变化,与数字、大型语言模型交互
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Advancements in natural language processing have revolutionized the way we can interact with digital information systems, such as databases, making them more accessible. However, challenges persist, especially when accuracy is critical, as in the biomedical domain. A key issue is the hallucination problem, where models generate information unsupported by the underlying data, potentially leading to dangerous misinformation. This paper presents a novel approach designed to bridge this gap by combining Large Language Models (LLM) and Knowledge Graphs (KG) to improve the accuracy and reliability of question-answering systems, on the example of a biomedical KG. Built on the LangChain framework, our method incorporates a query checker that ensures the syntactical and semantic validity of LLM-generated queries, which are then used to extract information from a Knowledge Graph, substantially reducing errors like hallucinations. We evaluated the overall performance using a new benchmark dataset of 50 biomedical questions, testing several LLMs, including GPT-4 Turbo and llama3:70b. Our results indicate that while GPT-4 Turbo outperforms other models in generating accurate queries, open-source models like llama3:70b show promise with appropriate prompt engineering. To make this approach accessible, a user-friendly web-based interface has been developed, allowing users to input natural language queries, view generated and corrected Cypher queries, and verify the resulting paths for accuracy. Overall, this hybrid approach effectively addresses common issues such as data gaps and hallucinations, offering a reliable and intuitive solution for question answering systems. The source code for generating the results of this paper and for the user-interface can be found in our Git repository: this https URL
摘要:自然语言处理的进步彻底改变了我们与数据库等数字信息系统交互的方式,使它们更容易访问。然而,挑战依然存在,特别是在准确性至关重要的情况下,如在生物医学领域。一个关键问题是幻觉问题,即模型生成的信息不受底层数据的支持,有可能导致危险的错误信息。本文以生物医学知识图为例,提出了一种将大语言模型(LLM)和知识图(KG)相结合的方法来提高问答系统的准确性和可靠性。在LangChain框架的基础上,我们的方法结合了一个查询检查器,该检查器确保LLM生成的查询在语法和语义上的有效性,然后这些查询被用于从知识图中提取信息,从而大大减少了幻觉等错误。我们使用一个包含50个生物医学问题的新基准数据集来评估整体性能,测试了几个LLM,包括GPT-4 Turbo和Llama3:70B。我们的结果表明,虽然GPT-4Turbo在生成准确查询方面优于其他模型,但像Llama3:70B这样的开源模型通过适当的即时工程表现出了良好的前景。为了使这种方法易于访问,开发了一个用户友好的基于网络的界面,允许用户输入自然语言查询,查看生成和更正的Cypher查询,并验证结果路径的准确性。总体而言,这种混合方法有效地解决了常见问题,如数据差距和幻觉,为问答系统提供了可靠和直观的解决方案。生成本文结果和用户界面的源代码可以在我们的Git存储库中找到:以下是HTTPS URL

[NLP-11] From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks
[NLP-11] 从计算到裁决:考试LLM法官的数学推理任务

链接: https://arxiv.org/abs/2409.04168
作者: Andreas Stephan,Dawei Zhu,Matthias Aßenmacher,Xiaoyu Shen,Benjamin Roth
关键词-EN: large language models, LLM judges, large language, judges, study LLM judges
关键词-ZH: 大型语言模型,LLM评委,大型语言,评委,研究LLM评委
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To reduce the need for human annotations, large language models (LLMs) have been proposed as judges of the quality of other candidate models. LLM judges are typically evaluated by measuring the correlation with human judgments on generation tasks such as summarization or machine translation. In contrast, we study LLM judges on mathematical reasoning tasks. These tasks require multi-step reasoning, and the correctness of their solutions is verifiable, enabling a more objective evaluation. We perform a detailed performance analysis and find that the used judges are mostly unable to improve task performance but are able to pick the better model. Our analysis uncovers a strong correlation between judgment performance and the candidate model task performance. We observe that judges tend to choose the model of higher quality even if its answer is incorrect. Further, we show that it is possible to use statistics, such as the task performances of the individual models, to predict judgment performance. In an ablation, we either swap or mask the candidate answers and observe that judges often keep the original judgment, providing evidence that judges incorporate writing style in their judgments. In summary, we find that regularities in the judgments are quantifiable using statistical measures and provide various angles on exploiting them.
摘要:为了减少对人工标注的需要,提出了用大语言模型(LLM)来判断其他候选模型的质量。LLM判断通常通过测量与人类在生成任务(如摘要或机器翻译)上的判断的相关性来进行评估。相反,我们研究的是数学推理任务中的LLM法官。这些任务需要多步骤推理,其解决方案的正确性是可验证的,从而能够进行更客观的评估。我们进行了详细的性能分析,发现所使用的评委大多无法提高任务绩效,但能够选择更好的模型。我们的分析发现,判断绩效与候选模型任务绩效之间存在很强的相关性。我们观察到,即使答案不正确,法官也倾向于选择质量更高的模型。此外,我们还表明,可以使用统计数据,例如各个模型的任务绩效,来预测判断绩效。在消融过程中,我们要么交换候选人的答案,要么掩盖候选人的答案,并观察到法官经常保留原始判决,提供证据表明法官在判决中融入了写作风格。总而言之,我们发现判断中的规律性是可以用统计方法来量化的,并为开发它们提供了不同的角度。

[NLP-12] Can OpenSource beat ChatGPT? – A Comparative Study of Large Language Models for Text-to-Code Generation
[NLP-12] 开源能击败ChatGPT吗?–用于文本到代码生成的大型语言模型的比较研究

链接: https://arxiv.org/abs/2409.04164
作者: Luis Mayer,Christian Heumann,Matthias Aßenmacher
关键词-EN: including software engineering, large language models, recent years, including software, software engineering
关键词-ZH: 包括软件工程、大型语言模型、近年来,包括软件、软件工程
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: Conference Paper accepted at the 9th SwissText Conference (2024)

点击查看摘要

Abstract:In recent years, large language models (LLMs) have emerged as powerful tools with potential applications in various fields, including software engineering. Within the scope of this research, we evaluate five different state-of-the-art LLMs - Bard, BingChat, ChatGPT, Llama2, and Code Llama - concerning their capabilities for text-to-code generation. In an empirical study, we feed prompts with textual descriptions of coding problems sourced from the programming website LeetCode to the models with the task of creating solutions in Python. Subsequently, the quality of the generated outputs is assessed using the testing functionalities of LeetCode. The results indicate large differences in performance between the investigated models. ChatGPT can handle these typical programming challenges by far the most effectively, surpassing even code-specialized models like Code Llama. To gain further insights, we measure the runtime as well as the memory usage of the generated outputs and compared them to the other code submissions on Leetcode. A detailed error analysis, encompassing a comparison of the differences concerning correct indentation and form of the generated code as well as an assignment of the incorrectly solved tasks to certain error categories allows us to obtain a more nuanced picture of the results and potential for improvement. The results also show a clear pattern of increasingly incorrect produced code when the models are facing a lot of context in the form of longer prompts.
摘要:近年来,大型语言模型已经成为一种强大的工具,在包括软件工程在内的各个领域都具有潜在的应用前景。在这项研究的范围内,我们评估了五种不同的最先进的LLM-Bard、BingChat、ChatGPT、Llama2和Code Llama-关于它们的文本到代码生成的能力。在一项实证研究中,我们将来自编程网站LeetCode的编码问题的文本描述反馈给模型,任务是用Python创建解决方案。随后,使用LeetCode的测试功能评估生成的输出的质量。结果表明,所调查的模型之间在性能上存在很大差异。到目前为止,ChatGPT可以最有效地处理这些典型的编程挑战,甚至超过了Code Llama等代码专门化模型。为了获得更深入的了解,我们测量了生成的输出的运行时和内存使用情况,并将它们与Leetcode上的其他代码提交进行了比较。详细的错误分析,包括比较与生成代码的正确缩进和形式有关的差异,以及将错误解决的任务分配到某些错误类别,使我们能够获得结果和改进潜力的更细微的图景。结果还显示,当模型以更长提示的形式面对大量上下文时,产生的代码越来越不正确的明显模式。

[NLP-13] A Coin Has Two Sides: A Novel Detector-Corrector Framework for Chinese Spelling Correction ECAI-2024
[NLP-13] 硬币有两面:一种新型的中文拼写纠正检测器-纠正器框架

链接: https://arxiv.org/abs/2409.04150
作者: Xiangke Zeng,Zuchao Li,Lefei Zhang,Ping Wang,Hongqiu Wu,Hai Zhao
关键词-EN: Natural Language Processing, foundational Natural Language, Chinese Spelling Correction, Language Processing, Chinese Spelling
关键词-ZH: 自然语言处理、基础自然语言、中文拼写纠正、语言处理、中文拼写
类目: Computation and Language (cs.CL)
备注: ECAI-2024

点击查看摘要

Abstract:Chinese Spelling Correction (CSC) stands as a foundational Natural Language Processing (NLP) task, which primarily focuses on the correction of erroneous characters in Chinese texts. Certain existing methodologies opt to disentangle the error correction process, employing an additional error detector to pinpoint error positions. However, owing to the inherent performance limitations of error detector, precision and recall are like two sides of the coin which can not be both facing up simultaneously. Furthermore, it is also worth investigating how the error position information can be judiciously applied to assist the error correction. In this paper, we introduce a novel approach based on error detector-corrector framework. Our detector is designed to yield two error detection results, each characterized by high precision and recall. Given that the occurrence of errors is context-dependent and detection outcomes may be less precise, we incorporate the error detection results into the CSC task using an innovative feature fusion strategy and a selective masking strategy. Empirical experiments conducted on mainstream CSC datasets substantiate the efficacy of our proposed method.
摘要:汉语拼写纠正是一项基础性的自然语言处理任务,其主要任务是对中文文本中的错误字进行纠正。某些现有的方法选择解开纠错过程,使用额外的错误检测器来精确定位错误位置。然而,由于检错器固有的性能局限性,查准率和查全率就像硬币的两面,不可能同时朝上。此外,如何明智地应用错误位置信息来辅助纠错也是值得研究的。本文提出了一种基于检错器-纠错器框架的新方法。我们的检测器设计为产生两个错误检测结果,每个结果都具有高精度和高召回率的特点。考虑到错误的发生与上下文相关,检测结果可能不那么精确,我们使用创新的特征融合策略和选择性掩蔽策略将错误检测结果整合到CSC任务中。在主流CSC数据集上进行的实验证明了该方法的有效性。

[NLP-14] Prompt-based Personality Profiling: Reinforcement Learning for Relevance Filtering
[NLP-14] 基于预算的性格分析:相关性过滤的强化学习

链接: https://arxiv.org/abs/2409.04122
作者: Jan Hofmann,Cornelia Sindermann,Roman Klinger
关键词-EN: inferring characteristics, characteristics about individuals, individuals by analyzing, large language models, Author profiling
关键词-ZH: 推断特征,个人特征,通过分析个人,大型语言模型,作者分析
类目: Computation and Language (cs.CL)
备注: preprint, under review, supplementary material will be made available upon acceptance of the paper

点击查看摘要

Abstract:Author profiling is the task of inferring characteristics about individuals by analyzing content they share. Supervised machine learning still dominates automatic systems that perform this task, despite the popularity of prompting large language models to address natural language understanding tasks. One reason is that the classification instances consist of large amounts of posts, potentially a whole user profile, which may exceed the input length of Transformers. Even if a model can use a large context window, the entirety of posts makes the application of API-accessed black box systems costly and slow, next to issues which come with such “needle-in-the-haystack” tasks. To mitigate this limitation, we propose a new method for author profiling which aims at distinguishing relevant from irrelevant content first, followed by the actual user profiling only with relevant data. To circumvent the need for relevance-annotated data, we optimize this relevance filter via reinforcement learning with a reward function that utilizes the zero-shot capabilities of large language models. We evaluate our method for Big Five personality trait prediction on two Twitter corpora. On publicly available real-world data with a skewed label distribution, our method shows similar efficacy to using all posts in a user profile, but with a substantially shorter context. An evaluation on a version of these data balanced with artificial posts shows that the filtering to relevant posts leads to a significantly improved accuracy of the predictions.
摘要:作者侧写是通过分析个人分享的内容来推断个人特征的任务。有监督的机器学习仍然主导着执行这一任务的自动系统,尽管促使大型语言模型处理自然语言理解任务的做法很受欢迎。一个原因是分类实例由大量帖子组成,可能包含整个用户配置文件,这可能会超过Transformers的输入长度。即使模型可以使用大的上下文窗口,整个帖子也会使API访问的黑匣子系统的应用程序成本高昂且速度慢,其次是伴随着这种大海捞针任务的问题。为了缓解这一局限性,我们提出了一种新的作者侧写方法,其目的是首先区分相关和无关的内容,然后只使用相关数据进行实际的用户侧写。为了避免对相关性注释数据的需要,我们通过带有奖励函数的强化学习来优化这个相关性过滤器,该函数利用了大型语言模型的零命中能力。我们在两个推特语料库上对我们的五大人格特质预测方法进行了评估。在标签分布不对称的公开可用的真实世界数据上,我们的方法显示出与使用用户配置文件中的所有帖子类似的效果,但上下文要短得多。对与人工帖子相平衡的这些数据的一个版本的评估表明,对相关帖子的过滤大大提高了预测的准确性。

[NLP-15] Confidence-Aware Document OCR Error Detection
[NLP-15] 保密文档OCR错误检测

链接: https://arxiv.org/abs/2409.04117
作者: Arthur Hemmer,Mickaël Coustaty,Nicola Bartolo,Jean-Marc Ogier
关键词-EN: Optical Character Recognition, Optical Character, Character Recognition, impact subsequent applications, OCR confidence scores
关键词-ZH: 光学字符识别、光学字符、字符识别、影响后续应用、OCR置信度分数
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Optical Character Recognition (OCR) continues to face accuracy challenges that impact subsequent applications. To address these errors, we explore the utility of OCR confidence scores for enhancing post-OCR error detection. Our study involves analyzing the correlation between confidence scores and error rates across different OCR systems. We develop ConfBERT, a BERT-based model that incorporates OCR confidence scores into token embeddings and offers an optional pre-training phase for noise adjustment. Our experimental results demonstrate that integrating OCR confidence scores can enhance error detection capabilities. This work underscores the importance of OCR confidence scores in improving detection accuracy and reveals substantial disparities in performance between commercial and open-source OCR technologies.
摘要:光学字符识别(OCR)继续面临影响后续应用的准确性挑战。为了解决这些错误,我们探索了OCR置信度分数在增强OCR后错误检测方面的实用性。我们的研究涉及分析不同OCR系统的置信分数和错误率之间的相关性。我们开发了ConfBERT,这是一个基于BERT的模型,它将OCR置信度分数融入到令牌嵌入中,并提供可选的预训练阶段用于噪音调整。我们的实验结果表明,集成OCR置信度分数可以增强错误检测能力。这项工作强调了OCR置信度分数在提高检测准确性方面的重要性,并揭示了商业和开源OCR技术之间的性能差异。

[NLP-16] Multi-Programming Language Ensemble for Code Generation in Large Language Model
[NLP-16] 大型语言模型中用于代码生成的多编程语言集合

链接: https://arxiv.org/abs/2409.04114
作者: Tengfei Xue,Xuefeng Li,Tahir Azim,Roman Smirnov,Jianhui Yu,Arash Sadrieh,Babak Pahlavan
关键词-EN: Large language models, Large language, significantly improved code, significantly improved, code generation
关键词-ZH: 大型语言模型、大型语言、显着改进的代码生成
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code available at this https URL

点击查看摘要

Abstract:Large language models (LLMs) have significantly improved code generation, particularly in one-pass code generation. However, most existing approaches focus solely on generating code in a single programming language, overlooking the potential of leveraging the multi-language capabilities of LLMs. LLMs have varying patterns of errors across different languages, suggesting that a more robust approach could be developed by leveraging these multi-language outputs. In this study, we propose Multi-Programming Language Ensemble (MPLE), a novel ensemble-based method that utilizes code generation across multiple programming languages to enhance overall performance. By treating each language-specific code generation process as an individual “weak expert” and effectively integrating their outputs, our method mitigates language-specific errors and biases. This multi-language ensemble strategy leverages the complementary strengths of different programming languages, enabling the model to produce more accurate and robust code. Our approach can be seamlessly integrated with commonly used techniques such as the reflection algorithm and Monte Carlo tree search to improve code generation quality further. Experimental results show that our framework consistently enhances baseline performance by up to 17.92% on existing benchmarks (HumanEval and HumanEval-plus), with a standout result of 96.25% accuracy on the HumanEval benchmark, achieving new state-of-the-art results across various LLM models. The code will be released at this https URL
摘要:大型语言模型(LLM)显著改进了代码生成,特别是在一次通过代码生成方面。然而,大多数现有的方法只关注用一种编程语言生成代码,而忽略了利用LLMS的多语言功能的潜力。LLM在不同的语言中有不同的错误模式,这表明可以通过利用这些多语言输出来开发更健壮的方法。在这项研究中,我们提出了多编程语言集成(MPLE),这是一种基于集成的新方法,它利用跨多种编程语言的代码生成来提高整体性能。通过将每个特定语言的代码生成过程视为一个单独的“弱专家”,并有效地集成它们的输出,我们的方法减少了特定语言的错误和偏差。这种多语言集成策略利用了不同编程语言的互补优势,使模型能够生成更准确和更健壮的代码。我们的方法可以与反射算法和蒙特卡罗树搜索等常用技术无缝集成,进一步提高代码生成质量。实验结果表明,我们的框架在现有基准(HumanEval和HumanEval-plus)上一致地将基线性能提高了17.92%,在HumanEval基准上获得了96.25%的突出结果,在各种LLM模型上实现了新的最先进的结果。代码将在此HTTPS URL发布

[NLP-17] Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100 NLP Researchers
[NLP-17] 法学硕士能否产生新颖的研究想法?由100名NLP研究人员进行的大规模人类研究

链接: https://arxiv.org/abs/2409.04109
作者: Chenglei Si,Diyi Yang,Tatsunori Hashimoto
关键词-EN: large language models, accelerate scientific discovery, Recent advancements, works proposing research, language models
关键词-ZH: 大型语言模型、加速科学发现、最新进展、提出研究的作品、语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: main paper is 20 pages

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have sparked optimism about their potential to accelerate scientific discovery, with a growing number of works proposing research agents that autonomously generate and validate new ideas. Despite this, no evaluations have shown that LLM systems can take the very first step of producing novel, expert-level ideas, let alone perform the entire research process. We address this by establishing an experimental design that evaluates research idea generation while controlling for confounders and performs the first head-to-head comparison between expert NLP researchers and an LLM ideation agent. By recruiting over 100 NLP researchers to write novel ideas and blind reviews of both LLM and human ideas, we obtain the first statistically significant conclusion on current LLM capabilities for research ideation: we find LLM-generated ideas are judged as more novel (p 0.05) than human expert ideas while being judged slightly weaker on feasibility. Studying our agent baselines closely, we identify open problems in building and evaluating research agents, including failures of LLM self-evaluation and their lack of diversity in generation. Finally, we acknowledge that human judgements of novelty can be difficult, even by experts, and propose an end-to-end study design which recruits researchers to execute these ideas into full projects, enabling us to study whether these novelty and feasibility judgements result in meaningful differences in research outcome.
摘要:大型语言模型(LLM)的最新进展引发了人们对其加速科学发现潜力的乐观情绪,越来越多的作品提出了自主生成和验证新想法的研究代理。尽管如此,没有任何评估表明LLM系统可以迈出产生新奇的专家级想法的第一步,更不用说执行整个研究过程了。我们通过建立一个实验设计来解决这个问题,该设计在控制混杂因素的同时评估研究想法的产生,并在专家NLP研究人员和LLM构思试剂之间进行第一次面对面的比较。通过招募100多名NLP研究人员撰写新想法并对LLM和人类的想法进行盲目评论,我们获得了关于当前LLM研究思维能力的第一个具有统计学意义的结论:我们发现LLM产生的想法比人类专家的想法更新颖(p0.05),而可行性略弱。仔细研究我们的代理基线,我们发现在构建和评估研究代理时存在的公开问题,包括LLM自我评估的失败和它们在世代中缺乏多样性。最后,我们承认人类对新颖性的判断可能很困难,即使是专家也是如此,并提出了一种端到端的研究设计,招募研究人员将这些想法落实到完整的项目中,使我们能够研究这些新颖性和可行性判断是否会导致研究结果的有意义的差异。

[NLP-18] Structure and dynamics of growing networks of Reddit threads
[NLP-18] 不断增长的Reddit线程网络的结构和动态

链接: https://arxiv.org/abs/2409.04085
作者: Diletta Goglia,Davide Vega
关键词-EN: sense of belonging, validation and self-recognition, reinforce their sense, Millions, social
关键词-ZH: 归属感、认可和自我认可,强化他们的感觉,数百万人,社交
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 29 pages, 9 figures, 5 tables

点击查看摘要

Abstract:Millions of people use online social networks to reinforce their sense of belonging, for example by giving and asking for feedback as a form of social validation and self-recognition. It is common to observe disagreement among people beliefs and points of view when expressing this feedback. Modeling and analyzing such interactions is crucial to understand social phenomena that happen when people face different opinions while expressing and discussing their values. In this work, we study a Reddit community in which people participate to judge or be judged with respect to some behavior, as it represents a valuable source to study how users express judgments online. We model threads of this community as complex networks of user interactions growing in time, and we analyze the evolution of their structural properties. We show that the evolution of Reddit networks differ from other real social networks, despite falling in the same category. This happens because their global clustering coefficient is extremely small and the average shortest path length increases over time. Such properties reveal how users discuss in threads, i.e. with mostly one other user and often by a single message. We strengthen such result by analyzing the role that disagreement and reciprocity play in such conversations. We also show that Reddit thread’s evolution over time is governed by two subgraphs growing at different speeds. We discover that, in the studied community, the difference of such speed is higher than in other communities because of the user guidelines enforcing specific user interactions. Finally, we interpret the obtained results on user behavior drawing back to Social Judgment Theory.
摘要:数以百万计的人使用在线社交网络来加强他们的归属感,例如,通过给予和要求反馈作为一种社会认可和自我认可的形式。在表达这种反馈时,观察到人们、信仰和观点之间的分歧是很常见的。对这种相互作用进行建模和分析,对于理解人们在表达和讨论自己的价值观时面对不同观点时发生的社会现象至关重要。在这项工作中,我们研究了一个Reddit社区,在这个社区中,人们参与到对某些行为的判断或被判断中,因为它是研究用户如何在线表达判断的宝贵来源。我们将这个社区的线程建模为随时间增长的用户交互的复杂网络,并分析其结构属性的演变。我们发现,Reddit网络的演变不同于其他真正的社交网络,尽管它们属于同一类别。这是因为它们的全局聚类系数非常小,并且平均最短路径长度随着时间的推移而增加。这些属性揭示了用户如何在线程中讨论,即主要与一个其他用户讨论,并且通常是通过一条消息进行讨论。我们通过分析不一致和互惠在这样的对话中所起的作用来加强这一结果。我们还表明,Reddit线程随时间的演化由两个以不同速度增长的子图控制。我们发现,在被研究的社区中,这种速度的差异比在其他社区中更高,因为用户指南强制执行特定的用户交互。最后,我们回顾了社会判断理论对用户行为的研究结果进行了解释。

[NLP-19] UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity
[NLP-19] UI-JEPA:通过屏幕用户活动主动感知用户意图

链接: https://arxiv.org/abs/2409.04081
作者: Yicheng Fu,Raviteja Anantha,Prabal Vashisht,Jianpeng Cheng,Etai Littwin
关键词-EN: Generating user intent, Generating user, intent, user intent, Generating
关键词-ZH: 生成用户意图,生成用户,意图,用户意图,生成
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generating user intent from a sequence of user interface (UI) actions is a core challenge in comprehensive UI understanding. Recent advancements in multimodal large language models (MLLMs) have led to substantial progress in this area, but their demands for extensive model parameters, computing power, and high latency makes them impractical for scenarios requiring lightweight, on-device solutions with low latency or heightened privacy. Additionally, the lack of high-quality datasets has hindered the development of such lightweight models. To address these challenges, we propose UI-JEPA, a novel framework that employs masking strategies to learn abstract UI embeddings from unlabeled data through self-supervised learning, combined with an LLM decoder fine-tuned for user intent prediction. We also introduce two new UI-grounded multimodal datasets, “Intent in the Wild” (IIW) and “Intent in the Tame” (IIT), designed for few-shot and zero-shot UI understanding tasks. IIW consists of 1.7K videos across 219 intent categories, while IIT contains 914 videos across 10 categories. We establish the first baselines for these datasets, showing that representations learned using a JEPA-style objective, combined with an LLM decoder, can achieve user intent predictions that match the performance of state-of-the-art large MLLMs, but with significantly reduced annotation and deployment resources. Measured by intent similarity scores, UI-JEPA outperforms GPT-4 Turbo and Claude 3.5 Sonnet by 10.0% and 7.2% respectively, averaged across two datasets. Notably, UI-JEPA accomplishes the performance with a 50.5x reduction in computational cost and a 6.6x improvement in latency in the IIW dataset. These results underscore the effectiveness of UI-JEPA, highlighting its potential for lightweight, high-performance UI understanding.
摘要:从一系列用户界面(UI)动作中生成用户意图是全面理解UI的核心挑战。多模式大型语言模型(MLLM)的最新进展导致了这一领域的实质性进展,但它们对广泛的模型参数、计算能力和高延迟的需求使其不适用于需要低延迟或高隐私的轻量级设备上解决方案的场景。此外,缺乏高质量的数据集阻碍了这种轻量级模型的开发。为了应对这些挑战,我们提出了一种新颖的框架UI-JEPA,该框架使用掩蔽策略通过自监督学习从未标记的数据中学习抽象的UI嵌入,并结合针对用户意图预测进行微调的LLM解码器。我们还引入了两个新的基于用户界面的多模式数据集,“Intent in the Wild”(IIW)和“Intent in the Tame”(IIT),专为少镜头和零镜头的UI理解任务而设计。IIW包含219个意向类别的1.7K视频,而IIT包含10个类别的914个视频。我们为这些数据集建立了第一个基线,表明使用JEPA风格的目标学习的表示与LLM解码器相结合,可以实现与最先进的大型MLLM性能相匹配的用户意图预测,但显著减少了注释和部署资源。以意图相似度得分衡量,UI-JEPA在两个数据集上的平均表现分别比GPT-4 Turbo和Claude 3.5十四行诗高10.0%和7.2%。值得注意的是,在IIW数据集中,UI-JEPA的计算成本降低了50.5倍,延迟提高了6.6倍。这些结果强调了UI-JEPA的有效性,突出了它在轻量级、高性能UI理解方面的潜力。

[NLP-20] AnyMatch – Efficient Zero-Shot Entity Matching with a Small Language Model
[NLP-20] AnyMatch --使用小语言模型的高效零镜头实体匹配

链接: https://arxiv.org/abs/2409.04073
作者: Zeyu Zhang,Paul Groth,Iacer Calixto,Sebastian Schelter
关键词-EN: records refer, product catalogs, catalogs or address, Entity matching, zero-shot entity matching
关键词-ZH: 记录引用、产品目录、目录或地址、实体匹配、零镜头实体匹配
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 12 pages excluding references, 3 figures, and 5 tables

点击查看摘要

Abstract:Entity matching (EM) is the problem of determining whether two records refer to same real-world entity, which is crucial in data integration, e.g., for product catalogs or address databases. A major drawback of many EM approaches is their dependence on labelled examples. We thus focus on the challenging setting of zero-shot entity matching where no labelled examples are available for an unseen target dataset. Recently, large language models (LLMs) have shown promising results for zero-shot EM, but their low throughput and high deployment cost limit their applicability and scalability. We revisit the zero-shot EM problem with AnyMatch, a small language model fine-tuned in a transfer learning setup. We propose several novel data selection techniques to generate fine-tuning data for our model, e.g., by selecting difficult pairs to match via an AutoML filter, by generating additional attribute-level examples, and by controlling label imbalance in the data. We conduct an extensive evaluation of the prediction quality and deployment cost of our model, in a comparison to thirteen baselines on nine benchmark datasets. We find that AnyMatch provides competitive prediction quality despite its small parameter size: it achieves the second-highest F1 score overall, and outperforms several other approaches that employ models with hundreds of billions of parameters. Furthermore, our approach exhibits major cost benefits: the average prediction quality of AnyMatch is within 4.4% of the state-of-the-art method MatchGPT with the proprietary trillion-parameter model GPT-4, yet AnyMatch requires four orders of magnitude less parameters and incurs a 3,899 times lower inference cost (in dollars per 1,000 tokens). Comments: 12 pages excluding references, 3 figures, and 5 tables Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB) Cite as: arXiv:2409.04073 [cs.CL] (or arXiv:2409.04073v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.04073 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:实体匹配(EM)是确定两条记录是否引用相同的真实实体的问题,这在产品目录或地址数据库等数据集成中至关重要。许多新兴市场方法的一个主要缺点是依赖于有标签的例子。因此,我们将重点放在具有挑战性的零镜头实体匹配设置上,其中对于不可见的目标数据集,没有标记的示例可用。近年来,大型语言模型(LLM)在零炮电磁方面取得了很好的效果,但其低吞吐量和高部署成本限制了它们的适用性和可扩展性。我们用AnyMatch重温了零概率EM问题,AnyMatch是一个在迁移学习设置中进行了微调的小型语言模型。我们提出了几种新的数据选择技术来为我们的模型生成微调数据,例如,通过AutoML过滤器选择要匹配的困难对,通过生成额外的属性级示例,以及通过控制数据中的标签失衡。我们对我们的模型的预测质量和部署成本进行了广泛的评估,并与9个基准数据集上的13个基线进行了比较。我们发现,尽管AnyMatch的参数很小,但它提供了具有竞争力的预测质量:它总体上达到了第二高的F1得分,并优于其他几种使用具有数千亿个参数的模型的方法。此外,我们的方法显示出主要的成本效益:AnyMatch的平均预测质量在最先进的方法MatchGPT与专有的万亿参数模型GPT-4的4.4%以内,但AnyMatch需要的参数少四个数量级,并产生3899倍的推理成本(以美元/1000令牌为单位)。评论:12页,不包括参考文献,3个图形和5个表格主题:计算和语言(cs.CL);人工智能(cs.AI);数据库(cs.DB)引用为:arxiv:2409.04073cs.CLhttps://doi.org/10.48550/arXiv.2409.04073 Focus通过DataCite了解更多arxiv发布的DOI(等待注册)

[NLP-21] Self-Harmonized Chain of Thought
[NLP-21] 自我协调的思维链

链接: https://arxiv.org/abs/2409.04057
作者: Ziqi Jin,Wei Lu
关键词-EN: large language models, performing complex reasoning, reveals that large, large language, capable of performing
关键词-ZH: 执行复杂推理的大型语言模型揭示了能够执行的大型语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting reveals that large language models are capable of performing complex reasoning via intermediate steps. CoT prompting is primarily categorized into three approaches. The first approach utilizes straightforward prompts like ``Let’s think step by step’’ to generate a sequential thought process before yielding an answer. The second approach makes use of human-crafted, step-by-step demonstrations to guide the model’s reasoning process. The third automates the generation of reasoned demonstrations with the ‘Let’s think step by step’.This approach sometimes leads to reasoning errors, highlighting the need to diversify demonstrations to mitigate its misleading effects. However, diverse demonstrations pose challenges for effective representations. In this work, we propose ECHO, a self-harmonized chain-of-thought prompting method. It consolidates diverse solution paths into a uniform and effective solution pattern.ECHO demonstrates the best overall performance across three reasoning domains.
摘要:思维链(CoT)提示揭示了大型语言模型能够通过中间步骤执行复杂的推理。COT提示主要分为三种方法。第一种方法利用直接的提示,比如“让我们一步一步地思考”,在得到答案之前产生一个连续的思考过程。第二种方法利用人工制作的循序渐进的演示来指导模型的推理过程。第三种方法是通过“让我们逐步思考”自动生成推理演示。这种方法有时会导致推理错误,强调需要使演示多样化以减少其误导性影响。然而,多样化的示威活动对有效的陈述提出了挑战。在这项工作中,我们提出了一种自我协调的思维链激励方法ECHO。它将不同的解路径整合成一个统一有效的解模式。ECHO在三个推理领域表现出最好的整体性能。

[NLP-22] Refining Wikidata Taxonomy using Large Language Models
[NLP-22] 使用大型语言模型精炼维基数据分类

链接: https://arxiv.org/abs/2409.04056
作者: Yiwen Peng(IP Paris),Thomas Bonald(IP Paris),Mehwish Alam(IP Paris)
关键词-EN: Large Language Models, collaborative nature, taxonomic paths, presence of cycles, recurrent issues
关键词-ZH: 大型语言模型、协作性质、分类路径、周期的存在、反复出现的问题
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: ACM International Conference on Information and Knowledge Management, Oct 2024, Boise, Idaho, United States

点击查看摘要

Abstract:Due to its collaborative nature, Wikidata is known to have a complex taxonomy, with recurrent issues like the ambiguity between instances and classes, the inaccuracy of some taxonomic paths, the presence of cycles, and the high level of redundancy across classes. Manual efforts to clean up this taxonomy are time-consuming and prone to errors or subjective decisions. We present WiKC, a new version of Wikidata taxonomy cleaned automatically using a combination of Large Language Models (LLMs) and graph mining techniques. Operations on the taxonomy, such as cutting links or merging classes, are performed with the help of zero-shot prompting on an open-source LLM. The quality of the refined taxonomy is evaluated from both intrinsic and extrinsic perspectives, on a task of entity typing for the latter, showing the practical interest of WiKC.
摘要:由于其协作性质,维基数据被认为具有复杂的分类学,经常出现的问题,例如实例和类之间的模糊性、一些分类路径的不准确性、循环的存在以及类之间的高度冗余。手动清理此分类法非常耗时,并且容易出现错误或主观决策。我们介绍了WiKC,这是维基数据分类法的新版本,使用大型语言模型(LLM)和图挖掘技术的组合自动清理。分类学上的操作,例如剪切链接或合并类,是在开源LLM上的零触发提示的帮助下执行的。从内在和外在的角度评估细化分类法的质量,并执行后者的实体类型任务,展示了WiKC的实际兴趣。

[NLP-23] owards Safer Online Spaces: Simulating and Assessing Intervention Strategies for Eating Disorder Discussions
[NLP-23] owards更安全的在线空间:模拟和评估饮食失调讨论的干预策略

链接: https://arxiv.org/abs/2409.04043
作者: Louis Penafiel,Hsien-Te Kao,Isabel Erickson,David Chu,Robert McCormack,Kristina Lerman,Svitlana Volkova
关键词-EN: mental health conditions, Eating disorders, complex mental health, mental health, health conditions
关键词-ZH: 心理健康状况、饮食失调、复杂心理健康、心理健康、健康状况
类目: Computation and Language (cs.CL)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Eating disorders are complex mental health conditions that affect millions of people around the world. Effective interventions on social media platforms are crucial, yet testing strategies in situ can be risky. We present a novel LLM-driven experimental testbed for simulating and assessing intervention strategies in ED-related discussions. Our framework generates synthetic conversations across multiple platforms, models, and ED-related topics, allowing for controlled experimentation with diverse intervention approaches. We analyze the impact of various intervention strategies on conversation dynamics across four dimensions: intervention type, generative model, social media platform, and ED-related community/topic. We employ cognitive domain analysis metrics, including sentiment, emotions, etc., to evaluate the effectiveness of interventions. Our findings reveal that civility-focused interventions consistently improve positive sentiment and emotional tone across all dimensions, while insight-resetting approaches tend to increase negative emotions. We also uncover significant biases in LLM-generated conversations, with cognitive metrics varying notably between models (Claude-3 Haiku Mistral GPT-3.5-turbo LLaMA3) and even between versions of the same model. These variations highlight the importance of model selection in simulating realistic discussions related to ED. Our work provides valuable information on the complex dynamics of ED-related discussions and the effectiveness of various intervention strategies.
摘要:进食障碍是一种复杂的心理健康问题,影响着全球数百万人。在社交媒体平台上进行有效的干预至关重要,但在现场测试策略可能会有风险。我们提出了一个新颖的LLM驱动的实验测试台,用于模拟和评估与ED相关的讨论中的干预策略。我们的框架生成跨多个平台、模型和ED相关主题的合成对话,允许使用不同的干预方法进行受控实验。我们从干预类型、生成模式、社交媒体平台、ED相关社区/话题四个维度分析了不同干预策略对会话动态的影响。我们使用认知域分析指标,包括情绪、情绪等,来评估干预的有效性。我们的发现表明,以礼貌为重点的干预在所有维度上都会持续改善积极的情绪和情绪基调,而洞察力重置的方法往往会增加负面情绪。我们还发现,在LLM生成的对话中存在显著的偏差,认知度量在不同模型(Claude-3 Haku Mistral GPT-3.5-Turbo LLaMA3)之间甚至在同一模型的不同版本之间都存在显著差异。这些变化突出了模型选择在模拟与ED相关的现实讨论中的重要性。我们的工作提供了关于ED相关讨论的复杂动态以及各种干预策略的有效性的有价值的信息。

[NLP-24] Large Margin Prototypical Network for Few-shot Relation Classification with Fine-grained Features CIKM’19
[NLP-24] 具有细粒度特征的少镜头关系分类大余量原型网络

链接: https://arxiv.org/abs/2409.04009
作者: Miao Fan,Yeqi Bai,Mingming Sun,Ping Li
关键词-EN: knowledge graph completion, natural language understanding, plays a pivotal, graph completion, pivotal role
关键词-ZH: 知识图完成,自然语言理解,发挥着关键,图完成,关键的作用
类目: Computation and Language (cs.CL)
备注: Accepted by CIKM’19

点击查看摘要

Abstract:Relation classification (RC) plays a pivotal role in both natural language understanding and knowledge graph completion. It is generally formulated as a task to recognize the relationship between two entities of interest appearing in a free-text sentence. Conventional approaches on RC, regardless of feature engineering or deep learning based, can obtain promising performance on categorizing common types of relation leaving a large proportion of unrecognizable long-tail relations due to insufficient labeled instances for training. In this paper, we consider few-shot learning is of great practical significance to RC and thus improve a modern framework of metric learning for few-shot RC. Specifically, we adopt the large-margin ProtoNet with fine-grained features, expecting they can generalize well on long-tail relations. Extensive experiments were conducted by FewRel, a large-scale supervised few-shot RC dataset, to evaluate our framework: LM-ProtoNet (FGF). The results demonstrate that it can achieve substantial improvements over many baseline approaches.
摘要:关系分类在自然语言理解和知识图补全中起着举足轻重的作用。它通常被制定为一项任务,即识别出现在自由文本句子中的两个感兴趣实体之间的关系。传统的关系分类方法,无论是基于特征工程的关系分类还是基于深度学习的关系分类,都能取得很好的分类效果,留下了很大一部分由于训练样本不足而无法识别的长尾关系。在本文中,我们认为少镜头学习对RC具有重要的现实意义,从而完善了一个用于少镜头RC的现代度量学习框架。具体地说,我们采用了具有细粒度特征的大间隔Protonet,期望它们能够很好地推广到长尾关系上。使用大规模监督少镜头RC数据集FewRel进行了大量的实验,以评估我们的框架:LM-Protonet(FGF)。结果表明,与许多基线方法相比,该方法可以取得实质性的改进。

[NLP-25] On The Role of Prompt Construction In Enhancing Efficacy and Efficiency of LLM-Based Tabular Data Generation
[NLP-25] 论快速构建在提高基于LLM的表格数据生成的效率和效率方面的作用

链接: https://arxiv.org/abs/2409.03946
作者: Banooqa Banday,Kowshik Thopalli,Tanzima Z. Islam,Jayaraman J. Thiagarajan
关键词-EN: sufficient semantic context, real-world tabular data, LLM-based data generation, describe columns, real-world tabular
关键词-ZH: 足够的语义上下文、真实世界的表格数据、基于LLM的数据生成、描述列、真实世界的表格
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM-based data generation for real-world tabular data can be challenged by the lack of sufficient semantic context in feature names used to describe columns. We hypothesize that enriching prompts with domain-specific insights can improve both the quality and efficiency of data generation. To test this hypothesis, we explore three prompt construction protocols: Expert-guided, LLM-guided, and Novel-Mapping. Through empirical studies with the recently proposed GReaT framework, we find that context-enriched prompts lead to significantly improved data generation quality and training efficiency.
摘要:用于描述列的特征名称缺乏足够的语义上下文,针对现实世界表格数据的基于LLM的数据生成可能会受到挑战。我们假设,用特定领域的见解丰富提示可以提高数据生成的质量和效率。为了测试这一假设,我们探索了三种即时构建协议:专家指导、LLM指导和Novel-Mapping。通过对最近提出的GReaT框架的实证研究,我们发现上下文丰富的提示可以显着提高数据生成质量和训练效率。

[NLP-26] Experimentation in Content Moderation using RWKV
[NLP-26] 使用RWKV进行内容审核实验

链接: https://arxiv.org/abs/2409.03939
作者: Umut Yildirim,Rohan Dutta,Burak Yildirim,Atharva Vaidya
关键词-EN: content moderation, RWKV model efficacy, paper investigates, moderation, content
关键词-ZH: 内容审核、RWKN模型功效、论文调查、审核、内容
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper investigates the RWKV model’s efficacy in content moderation through targeted experimentation. We introduce a novel dataset specifically designed for distillation into smaller models, enhancing content moderation practices. This comprehensive dataset encompasses images, videos, sounds, and text data that present societal challenges. Leveraging advanced Large Language Models (LLMs), we generated an extensive set of responses – 558,958 for text and 83,625 for images – to train and refine content moderation systems. Our core experimentation involved fine-tuning the RWKV model, capitalizing on its CPU-efficient architecture to address large-scale content moderation tasks. By highlighting the dataset’s potential for knowledge distillation, this study not only demonstrates RWKV’s capability in improving the accuracy and efficiency of content moderation systems but also paves the way for developing more compact, resource-efficient models in this domain. Datasets and models can be found in HuggingFace: this https URL
摘要:通过有针对性的实验,考察了RWKV模型在内容调节方面的有效性。我们引入了一个新的数据集,专门为蒸馏成更小的模型而设计,增强了内容调节实践。这一全面的数据集涵盖了代表社会挑战的图像、视频、声音和文本数据。利用先进的大型语言模型(LLM),我们生成了一组广泛的响应–558,958个文本和83,625个图像–以培训和改进内容审核系统。我们的核心实验包括微调RWKV模型,利用其高效的CPU架构来处理大规模的内容审核任务。通过突出数据集在知识提炼方面的潜力,这项研究不仅展示了RWKV在提高内容审核系统的准确性和效率方面的能力,而且也为在该领域开发更紧凑、资源高效的模型铺平了道路。数据集和模型可在HuggingFace中找到:此HTTPS URL

[NLP-27] CACER: Clinical Concept Annotations for Cancer Events and Relations
[NLP-27] CABER:癌症事件和关系的临床概念注释

链接: https://arxiv.org/abs/2409.03905
作者: Yujuan Fu,Giridhar Kaushik Ramachandran,Ahmad Halwani,Bridget T. McInnes,Fei Xia,Kevin Lybarger,Meliha Yetisgen,Özlem Uzuner
关键词-EN: Clinical Concept Annotations, medical problems, present Clinical Concept, patient histories, BERT
关键词-ZH: 临床概念注释、医疗问题、当前临床概念、患者病史、BERT
类目: Computation and Language (cs.CL)
备注: This is a pre-copy-editing, author-produced PDF of an article accepted for publication in JAMIA following peer review. The definitive publisher-authenticated version is available online at this https URL

点击查看摘要

Abstract:Clinical notes contain unstructured representations of patient histories, including the relationships between medical problems and prescription drugs. To investigate the relationship between cancer drugs and their associated symptom burden, we extract structured, semantic representations of medical problem and drug information from the clinical narratives of oncology notes. We present Clinical Concept Annotations for Cancer Events and Relations (CACER), a novel corpus with fine-grained annotations for over 48,000 medical problems and drug events and 10,000 drug-problem and problem-problem relations. Leveraging CACER, we develop and evaluate transformer-based information extraction (IE) models such as BERT, Flan-T5, Llama3, and GPT-4 using fine-tuning and in-context learning (ICL). In event extraction, the fine-tuned BERT and Llama3 models achieved the highest performance at 88.2-88.0 F1, which is comparable to the inter-annotator agreement (IAA) of 88.4 F1. In relation extraction, the fine-tuned BERT, Flan-T5, and Llama3 achieved the highest performance at 61.8-65.3 F1. GPT-4 with ICL achieved the worst performance across both tasks. The fine-tuned models significantly outperformed GPT-4 in ICL, highlighting the importance of annotated training data and model optimization. Furthermore, the BERT models performed similarly to Llama3. For our task, LLMs offer no performance advantage over the smaller BERT models. The results emphasize the need for annotated training data to optimize models. Multiple fine-tuned transformer models achieved performance comparable to IAA for several extraction tasks.
摘要:临床记录包含对病历的非结构化表示,包括医疗问题和处方药之间的关系。为了研究癌症药物与其相关症状负担之间的关系,我们从肿瘤学笔记的临床叙述中提取了医疗问题和药物信息的结构化、语义表示。我们提出了癌症事件和关系的临床概念标注(CACER),这是一个新的语料库,拥有超过48,000个医疗问题和药物事件以及10,000个药物问题和问题-问题关系的细粒度标注。利用CACER,我们使用微调和上下文学习(ICL)开发和评估基于变压器的信息提取(IE)模型,如BERT、FLAN-T5、Llama3和GPT-4。在事件提取方面,微调的BERT和Llama3模型在88.2-88.0F1上取得了最高的表现,这与88.4F1的注释员间协议(IAA)相当。在关系提取方面,微调的BERT、Flan-T5和Llama3在61.8-65.3F1表现最好。使用ICL的GPT-4在两个任务中的性能都是最差的。微调的模型在ICL中的表现明显优于GPT-4,突显了带注释的训练数据和模型优化的重要性。此外,BERT模型的性能与Llama3相似。对于我们的任务,LLMS没有提供比较小的BERT型号更高的性能优势。结果强调了需要带注释的训练数据来优化模型。多个微调变压器模型在几个提取任务中实现了与IAA相当的性能。

[NLP-28] Sirius: Contextual Sparsity with Correction for Efficient LLMs
[NLP-28] Sirius:上下文稀疏性和有效LLM的纠正

链接: https://arxiv.org/abs/2409.03856
作者: Yang Zhou,Zhuoming Chen,Zhaozhuo Xu,Victoria Lin,Beidi Chen
关键词-EN: large language models, increasingly important, blossom of large, large language, Sirius
关键词-ZH: 大型语言模型越来越重要,大型语言Sirius的蓬勃发展
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the blossom of large language models (LLMs), inference efficiency becomes increasingly important. Various approximation methods are proposed to reduce the cost at inference time. Contextual Sparsity (CS) is appealing for its training-free nature and its ability to reach a higher compression ratio seemingly without quality degradation. However, after a comprehensive evaluation of contextual sparsity methods on various complex generation tasks, we find that although CS succeeds in prompt-understanding tasks, CS significantly degrades the model performance for reasoning, deduction, and knowledge-based tasks. Despite the gap in end-to-end accuracy, we observed that sparse models often share general problem-solving logic and require only a few token corrections to recover the original model performance. This paper introduces Sirius, an efficient correction mechanism, which significantly recovers CS models quality on reasoning tasks while maintaining its efficiency gain. Sirius is evaluated on 6 models with 8 difficult generation tasks in reasoning, math, and coding and shows consistent effectiveness and efficiency. Also, we carefully develop a system implementation for Sirius and show that Sirius achieves roughly 20% reduction in latency for 8B model on-chip and 35% reduction for 70B model offloading. We open-source our implementation of Sirius at this https URL.
摘要:随着大型语言模型的兴起,推理效率变得越来越重要。为了减少推理时的代价,人们提出了各种近似方法。上下文稀疏性(CS)因其无需训练的性质以及在似乎不降低质量的情况下达到更高的压缩比的能力而受到人们的欢迎。然而,在对各种复杂生成任务的上下文稀疏方法进行综合评估后,我们发现,尽管CS在即时理解任务中取得了成功,但CS在推理、推理和基于知识的任务中显著降低了模型的性能。尽管在端到端的准确性方面存在差距,我们观察到稀疏模型通常共享一般的问题解决逻辑,并且只需要几个象征性的修正就可以恢复原始模型的性能。本文介绍了一种有效的校正机制Sirius,它在保持效率收益的同时,显著地恢复了CS模型在推理任务中的质量。天狼星在6个模型上进行了评估,在推理、数学和编码方面有8个困难的生成任务,表现出一致的有效性和效率。此外,我们仔细开发了一个针对天狼星的系统实现,结果表明,对于8B模型片上,Sirius实现了大约20%的延迟减少,而对于70B模型卸载,则减少了35%。我们在这个HTTPS URL上开放了我们的Sirius实现。

[NLP-29] Persona Setting Pitfall: Persistent Outgroup Biases in Large Language Models Arising from Social Identity Adoption
[NLP-29] 女神异闻录设置陷阱:社交身份采用引起的大型语言模型中持续的外群体偏见

链接: https://arxiv.org/abs/2409.03843
作者: Wenchao Dong,Assem Zhunis,Dongyoung Jeong,Hyojin Chin,Jiyoung Han,Meeyoung Cha
关键词-EN: internalize identities imposed, Drawing parallels, Social Identity Theory, artificial intelligence, internalize identities
关键词-ZH: 内化强加的身份,绘制相似之处,社会身份理论,人工智能,内化身份
类目: Computation and Language (cs.CL)
备注: 23 pages, 5 figures

点击查看摘要

Abstract:Drawing parallels between human cognition and artificial intelligence, we explored how large language models (LLMs) internalize identities imposed by targeted prompts. Informed by Social Identity Theory, these identity assignments lead LLMs to distinguish between “we” (the ingroup) and “they” (the outgroup). This self-categorization generates both ingroup favoritism and outgroup bias. Nonetheless, existing literature has predominantly focused on ingroup favoritism, often overlooking outgroup bias, which is a fundamental source of intergroup prejudice and discrimination. Our experiment addresses this gap by demonstrating that outgroup bias manifests as strongly as ingroup favoritism. Furthermore, we successfully mitigated the inherent pro-liberal, anti-conservative bias in LLMs by guiding them to adopt the perspectives of the initially disfavored group. These results were replicated in the context of gender bias. Our findings highlight the potential to develop more equitable and balanced language models.
摘要:通过对比人类认知和人工智能,我们探索了大语言模型(LLM)如何内化目标提示所强加的身份。在社会认同理论的启发下,这些身份分配导致LLM区分“我们”(内部群体)和“他们”(外部群体)。这种自我归类既产生了内部偏爱,也产生了外部偏向。尽管如此,现有的文献主要集中在群体内部偏袒,往往忽视了群体偏见,这是群体间偏见和歧视的根本来源。我们的实验解决了这一差距,通过证明群体外偏见和群体内偏爱一样强烈地表现出来。此外,我们通过引导低收入国家采纳最初不受欢迎的群体的观点,成功地缓解了低收入国家固有的亲自由、反保守偏见。在性别偏见的背景下,这些结果也得到了复制。我们的发现突出了开发更公平和平衡的语言模型的潜力。

[NLP-30] How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data
[NLP-30] 您的代码LLM性能如何?利用高质量数据支持代码指令调优

链接: https://arxiv.org/abs/2409.03810
作者: Yejie Wang,Keqing He,Dayuan Fu,Zhuoma Gongque,Heyang Xu,Yanxu Chen,Zhexu Wang,Yujia Fu,Guanting Dong,Muxi Diao,Jingang Wang,Mengdi Zhang,Xunliang Cai,Weiran Xu
关键词-EN: growing interest, interest in studying, Recently, data, code
关键词-ZH: 兴趣越来越大,学习兴趣,最近,数据,代码
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Working in progress

点击查看摘要

Abstract:Recently, there has been a growing interest in studying how to construct better code instruction tuning data. However, we observe Code models trained with these datasets exhibit high performance on HumanEval but perform worse on other benchmarks such as LiveCodeBench. Upon further investigation, we find that many datasets suffer from severe data leakage. After cleaning up most of the leaked data, some well-known high-quality datasets perform poorly. This discovery reveals a new challenge: identifying which dataset genuinely qualify as high-quality code instruction data. To address this, we propose an efficient code data pruning strategy for selecting good samples. Our approach is based on three dimensions: instruction complexity, response quality, and instruction diversity. Based on our selected data, we present XCoder, a family of models finetuned from LLaMA3. Our experiments show XCoder achieves new state-of-the-art performance using fewer training data, which verify the effectiveness of our data strategy. Moreover, we perform a comprehensive analysis on the data composition and find existing code datasets have different characteristics according to their construction methods, which provide new insights for future code LLMs. Our models and dataset are released in this https URL
摘要:近年来,人们对如何构建更好的代码指令调优数据越来越感兴趣。然而,我们观察到,使用这些数据集训练的Code模型在HumanEval上表现出较高的性能,但在其他基准测试(如LiveCodeB边)上表现较差。进一步调查发现,很多数据集都存在严重的数据泄露问题。在清理了大部分泄露的数据后,一些知名的高质量数据集表现不佳。这一发现揭示了一个新的挑战:识别哪个数据集真正符合高质量的代码指令数据。为了解决这个问题,我们提出了一种有效的代码数据剪枝策略来选择好的样本。我们的方法基于三个维度:指令复杂性、响应质量和指令多样性。基于我们选择的数据,我们提出了XCoder,一个由LLaMA3精调而来的模型家族。我们的实验表明,XCoder使用更少的训练数据获得了最新的性能,这验证了我们的数据策略的有效性。此外,我们对数据组成进行了综合分析,发现现有的代码数据集根据其构建方法具有不同的特点,这为未来的代码低成本模型提供了新的见解。我们的模型和数据集在此HTTPS URL中发布

[NLP-31] NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls
[NLP-31] NESTFUL:评估API调用嵌套序列LLM的基准

链接: https://arxiv.org/abs/2409.03797
作者: Kinjal Basu,Ibrahim Abdelaziz,Kelsey Bradford,Maxwell Crouse,Kiran Kate,Sadhana Kumaravel,Saurabh Goyal,Asim Munawar,Yara Rizk,Xin Wang,Luis Lastras,Pavan Kapanipathi
关键词-EN: Autonomous agent applications, complex real-world tasks, Application Programming Interfaces, agent applications powered, addressing complex real-world
关键词-ZH: 自主代理应用程序、复杂的现实世界任务、应用编程接口、支持代理应用程序、解决复杂的现实世界
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Autonomous agent applications powered by large language models (LLMs) have recently risen to prominence as effective tools for addressing complex real-world tasks. At their core, agentic workflows rely on LLMs to plan and execute the use of tools and external Application Programming Interfaces (APIs) in sequence to arrive at the answer to a user’s request. Various benchmarks and leaderboards have emerged to evaluate an LLM’s capabilities for tool and API use; however, most of these evaluations only track single or multiple isolated API calling capabilities. In this paper, we present NESTFUL, a benchmark to evaluate LLMs on nested sequences of API calls, i.e., sequences where the output of one API call is passed as input to a subsequent call. NESTFUL has a total of 300 human annotated samples divided into two types - executable and non-executable. The executable samples are curated manually by crawling Rapid-APIs whereas the non-executable samples are hand picked by human annotators from data synthetically generated using an LLM. We evaluate state-of-the-art LLMs with function calling abilities on NESTFUL. Our results show that most models do not perform well on nested APIs in NESTFUL as compared to their performance on the simpler problem settings available in existing benchmarks.
摘要:由大型语言模型(LLM)驱动的自主代理应用程序最近作为处理复杂现实任务的有效工具而日益突出。在其核心,代理工作流依赖于LLM来计划和执行工具的使用和外部应用程序编程接口(API)的顺序,以达到对用户请求的回答。已经出现了各种基准测试和排行榜来评估LLM的工具和API使用能力;然而,这些评估中的大多数只跟踪单个或多个孤立的API调用功能。在本文中,我们提出了NESTFUL,一个基准来评估嵌套的API调用序列上的LLMS,即其中一个API调用的输出作为输入传递给后续调用的序列。NESTFUL总共有300个人工标注的样本,分为可执行和不可执行两种类型。可执行样本是通过爬行快速API来手动管理的,而非可执行样本是由人工注释员从使用LLM合成生成的数据中手工挑选的。我们在NESTFUL上评估了最先进的具有函数调用能力的LLM。我们的结果表明,大多数模型在NESTFUL中的嵌套API上的性能不如在现有基准测试中可用的更简单的问题设置上的性能。

[NLP-32] HSF: Defending against Jailbreak Attacks with Hidden State Filtering
[NLP-32] HSF:利用隐藏状态过滤防御越狱攻击

链接: https://arxiv.org/abs/2409.03788
作者: Cheng Qian,Hainan Zhang,Lei Sha,Zhiming Zheng
关键词-EN: ensure outputs align, avoid harmful content, LLM hidden state, content generation, jailbreak attacks
关键词-ZH: 确保输出一致、避免有害内容、LLM隐藏状态、内容生成、越狱攻击
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 13 pages

点击查看摘要

Abstract:With the growing deployment of LLMs in daily applications like chatbots and content generation, efforts to ensure outputs align with human values and avoid harmful content have intensified. However, increasingly sophisticated jailbreak attacks threaten this alignment, aiming to induce unsafe outputs. Current defense efforts either focus on prompt rewriting or detection, which are limited in effectiveness due to the various design of jailbreak prompts, or on output control and detection, which are computationally expensive as they require LLM inference. Therefore, designing a pre-inference defense method that resists diverse jailbreak prompts is crucial for preventing LLM jailbreak attacks. We observe that jailbreak attacks, safe queries, and harmful queries exhibit different clustering patterns within the LLM’s hidden state representation space. This suggests that by leveraging the LLM’s hidden state representational capabilities, we can analyze the LLM’s forthcoming behavior and proactively intervene for defense. In this paper, we propose a jailbreak attack defense strategy based on a Hidden State Filter (HSF), a lossless architectural defense mechanism that enables the model to preemptively identify and reject adversarial inputs before the inference process begins. We activate its defensive potential through an additional plugin module, effectively framing the defense task as a classification problem. Experimental results on two benchmark datasets, utilizing three different LLMs, show that HSF significantly enhances resilience against six cutting-edge jailbreak attacks. It significantly reduces the success rate of jailbreak attacks while minimally impacting responses to benign user queries, with negligible inference overhead, and outperforming defense baselines.Our code and data are available at https://anonymous.4open.science/r/Hidden-State-Filtering-8652/
摘要:随着LLMS在聊天机器人和内容生成等日常应用中的不断部署,确保输出符合人类价值并避免有害内容的努力得到了加强。然而,日益复杂的越狱攻击威胁着这种结合,旨在诱导不安全的输出。当前的防御努力要么集中在即时重写或检测上,由于越狱提示的各种设计而在有效性上受到限制,要么集中在输出控制和检测上,因为它们需要LLM推理,计算代价很高。因此,设计一种能够抵抗不同越狱提示的预推理防御方法对于防止LLM越狱攻击至关重要。我们观察到越狱攻击、安全查询和有害查询在LLM的隐藏状态表示空间中呈现出不同的聚类模式。这表明,通过利用LLM的隐藏状态表征能力,我们可以分析LLM即将到来的行为,并主动干预防御。在本文中,我们提出了一种基于隐藏状态过滤器(HSF)的越狱攻击防御策略,HSF是一种无损的体系结构防御机制,使模型能够在推理过程开始之前抢先识别和拒绝对手输入。我们通过一个额外的插件模块来激活它的防御潜力,有效地将防御任务框架化为一个分类问题。在两个基准数据集上使用三个不同的LLM进行的实验结果表明,HSF显著提高了对六种前沿越狱攻击的弹性。它显著降低了越狱攻击的成功率,同时对良性用户查询的响应影响最小,推理开销可以忽略不计,并且性能优于防御基线。我们的代码和数据可在https://anonymous.4open.science/r/Hidden-State-Filtering-8652/上获得

[NLP-33] Detection and Positive Reconstruction of Cognitive Distortion sentences: Mandarin Dataset and Evaluation
[NLP-33] 认知扭曲句子的检测与正性重建:普通话数据集与评估

链接: https://arxiv.org/abs/2405.15334
作者: Shuya Lin,Yuxiong Wang,Jonathan Dong,Shiguang Ni
关键词-EN: Reconstruction Framework based, Positive Reconstruction Framework, Positive Reconstruction, Reconstruction Framework, positive psychology theory
关键词-ZH: 基于重建框架,积极重建框架,积极重建,重建框架,积极心理学理论
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This research introduces a Positive Reconstruction Framework based on positive psychology theory. Overcoming negative thoughts can be challenging, our objective is to address and reframe them through a positive reinterpretation. To tackle this challenge, a two-fold approach is necessary: identifying cognitive distortions and suggesting a positively reframed alternative while preserving the original thought’s meaning. Recent studies have investigated the application of Natural Language Processing (NLP) models in English for each stage of this process. In this study, we emphasize the theoretical foundation for the Positive Reconstruction Framework, grounded in broaden-and-build theory. We provide a shared corpus containing 4001 instances for detecting cognitive distortions and 1900 instances for positive reconstruction in Mandarin. Leveraging recent NLP techniques, including transfer learning, fine-tuning pretrained networks, and prompt engineering, we demonstrate the effectiveness of automated tools for both tasks. In summary, our study contributes to multilingual positive reconstruction, highlighting the effectiveness of NLP in cognitive distortion detection and positive reconstruction.
摘要:本研究介绍了一个基于积极心理学理论的积极重建框架。克服消极思想可能是具有挑战性的,我们的目标是通过积极的重新解释来解决和重新构建它们。为了应对这一挑战,有必要采取两种方法:识别认知扭曲,并在保留原始思想含义的同时,提出一种积极重构的替代方案。最近的研究调查了自然语言处理(NLP)模式在英语自然语言处理过程的各个阶段的应用。在这项研究中,我们强调了积极重建框架的理论基础,该框架以扩展和构建理论为基础。我们提供了一个共享语料库,该语料库包含4001个认知扭曲检测实例和1900个正向重构实例。利用最新的NLP技术,包括转移学习、微调预先训练的网络和快速工程,我们证明了自动化工具对这两项任务的有效性。综上所述,我们的研究有助于多语言的积极重建,突出了自然语言处理在认知失真检测和积极重建方面的有效性。

人工智能

[AI-0] Accelerating Training with Neuron Interaction and Nowcasting Networks

链接: https://arxiv.org/abs/2409.04434
作者: Boris Knyazev,Abhinav Moudgil,Guillaume Lajoie,Eugene Belilovsky,Simon Lacoste-Julien
关键词-EN: classic adaptive optimizers, learnable update rule, adaptive optimizers, learnable update, lieu of classic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: code this https URL

点击查看摘要

Abstract:Neural network training can be accelerated when a learnable update rule is used in lieu of classic adaptive optimizers (e.g. Adam). However, learnable update rules can be costly and unstable to train and use. A simpler recently proposed approach to accelerate training is to use Adam for most of the optimization steps and periodically, only every few steps, nowcast (predict future) parameters. We improve this approach by Neuron interaction and Nowcasting (NiNo) networks. NiNo leverages neuron connectivity and graph neural networks to more accurately nowcast parameters by learning in a supervised way from a set of training trajectories over multiple tasks. We show that in some networks, such as Transformers, neuron connectivity is non-trivial. By accurately modeling neuron connectivity, we allow NiNo to accelerate Adam training by up to 50% in vision and language tasks.

[AI-1] A Survey on Knowledge Organization Systems of Research Fields: Resources and Challenges

链接: https://arxiv.org/abs/2409.04432
作者: Angelo Salatino,Tanay Aggarwal,Andrea Mannocci,Francesco Osborne,Enrico Motta
关键词-EN: Knowledge Organization Systems, Organization Systems, Knowledge Organization, play a fundamental, role in categorising
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Knowledge Organization Systems (KOSs), such as term lists, thesauri, taxonomies, and ontologies, play a fundamental role in categorising, managing, and retrieving information. In the academic domain, KOSs are often adopted for representing research areas and their relationships, primarily aiming to classify research articles, academic courses, patents, books, scientific venues, domain experts, grants, software, experiment materials, and several other relevant products and agents. These structured representations of research areas, widely embraced by many academic fields, have proven effective in empowering AI-based systems to i) enhance retrievability of relevant documents, ii) enable advanced analytic solutions to quantify the impact of academic research, and iii) analyse and forecast research dynamics. This paper aims to present a comprehensive survey of the current KOS for academic disciplines. We analysed and compared 45 KOSs according to five main dimensions: scope, structure, curation, usage, and links to other KOSs. Our results reveal a very heterogeneous scenario in terms of scope, scale, quality, and usage, highlighting the need for more integrated solutions for representing research knowledge across academic fields. We conclude by discussing the main challenges and the most promising future directions.

[AI-2] Hybrid Spiking Neural Networks for Low-Power Intra-Cortical Brain-Machine Interfaces

链接: https://arxiv.org/abs/2409.04428
作者: Alexandru Vasilache,Jann Krausse,Klaus Knobloch,Juergen Becker
关键词-EN: Intra-cortical brain-machine interfaces, perform daily activities, Intra-cortical brain-machine, brain-machine interfaces, daily activities
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注: This work has been accepted at the 2024 IEEE Biomedical Circuits and Systems Conference

点击查看摘要

Abstract:Intra-cortical brain-machine interfaces (iBMIs) have the potential to dramatically improve the lives of people with paraplegia by restoring their ability to perform daily activities. However, current iBMIs suffer from scalability and mobility limitations due to bulky hardware and wiring. Wireless iBMIs offer a solution but are constrained by a limited data rate. To overcome this challenge, we are investigating hybrid spiking neural networks for embedded neural decoding in wireless iBMIs. The networks consist of a temporal convolution-based compression followed by recurrent processing and a final interpolation back to the original sequence length. As recurrent units, we explore gated recurrent units (GRUs), leaky integrate-and-fire (LIF) neurons, and a combination of both - spiking GRUs (sGRUs) and analyze their differences in terms of accuracy, footprint, and activation sparsity. To that end, we train decoders on the “Nonhuman Primate Reaching with Multichannel Sensorimotor Cortex Electrophysiology” dataset and evaluate it using the NeuroBench framework, targeting both tracks of the IEEE BioCAS Grand Challenge on Neural Decoding. Our approach achieves high accuracy in predicting velocities of primate reaching movements from multichannel primary motor cortex recordings while maintaining a low number of synaptic operations, surpassing the current baseline models in the NeuroBench framework. This work highlights the potential of hybrid neural networks to facilitate wireless iBMIs with high decoding precision and a substantial increase in the number of monitored neurons, paving the way toward more advanced neuroprosthetic technologies.

[AI-3] RLPF: Reinforcement Learning from Prediction Feedback for User Summarization with LLMs

链接: https://arxiv.org/abs/2409.04421
作者: Jiaxing Wu,Lin Ning,Luyang Liu,Harrison Lee,Neo Wu,Chao Wang,Sushant Prakash,Shawn O’Banion,Bradley Green,Jun Xie
关键词-EN: Large Language Models, employ Large Language, Language Models, Large Language, systems employ Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:LLM-powered personalization agent systems employ Large Language Models (LLMs) to predict users’ behavior from their past activities. However, their effectiveness often hinges on the ability to effectively leverage extensive, long user historical data due to its inherent noise and length of such data. Existing pretrained LLMs may generate summaries that are concise but lack the necessary context for downstream tasks, hindering their utility in personalization systems. To address these challenges, we introduce Reinforcement Learning from Prediction Feedback (RLPF). RLPF fine-tunes LLMs to generate concise, human-readable user summaries that are optimized for downstream task performance. By maximizing the usefulness of the generated summaries, RLPF effectively distills extensive user history data while preserving essential information for downstream tasks. Our empirical evaluation demonstrates significant improvements in both extrinsic downstream task utility and intrinsic summary quality, surpassing baseline methods by up to 22% on downstream task performance and achieving an up to 84.59% win rate on Factuality, Abstractiveness, and Readability. RLPF also achieves a remarkable 74% reduction in context length while improving performance on 16 out of 19 unseen tasks and/or datasets, showcasing its generalizability. This approach offers a promising solution for enhancing LLM personalization by effectively transforming long, noisy user histories into informative and human-readable representations.

[AI-4] Improved Parallel Algorithm for Non-Monotone Submodular Maximization under Knapsack Constraint IJCAI

链接: https://arxiv.org/abs/2409.04415
作者: Tan D. Tran,Canh V. Pham,Dung T. K. Ha,Phuong N.H. Pham
关键词-EN: knapsack constraint problem, non-monotone submodular maximization, set of size, efficient parallel algorithm, work proposes
类目: Artificial Intelligence (cs.AI)
*备注: In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI), Main Track

点击查看摘要

Abstract:This work proposes an efficient parallel algorithm for non-monotone submodular maximization under a knapsack constraint problem over the ground set of size n . Our algorithm improves the best approximation factor of the existing parallel one from 8+\epsilon to 7+\epsilon with O(\log n) adaptive complexity. The key idea of our approach is to create a new alternate threshold algorithmic framework. This strategy alternately constructs two disjoint candidate solutions within a constant number of sequence rounds. Then, the algorithm boosts solution quality without sacrificing the adaptive complexity. Extensive experimental studies on three applications, Revenue Maximization, Image Summarization, and Maximum Weighted Cut, show that our algorithm not only significantly increases solution quality but also requires comparative adaptivity to state-of-the-art algorithms. Comments: In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI), Main Track Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2409.04415 [cs.AI] (or arXiv:2409.04415v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2409.04415 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-5] Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation

链接: https://arxiv.org/abs/2409.04410
作者: Zhuoyan Luo,Fengyuan Shi,Yixiao Ge,Yujiu Yang,Limin Wang,Ying Shan
关键词-EN: auto-regressive image generation, image generation models, generation models ranging, replication of Google, auto-regressive image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present Open-MAGVIT2, a family of auto-regressive image generation models ranging from 300M to 1.5B. The Open-MAGVIT2 project produces an open-source replication of Google’s MAGVIT-v2 tokenizer, a tokenizer with a super-large codebook (i.e., 2^18 codes), and achieves the state-of-the-art reconstruction performance (1.17 rFID) on ImageNet 256 \times 256 . Furthermore, we explore its application in plain auto-regressive models and validate scalability properties. To assist auto-regressive models in predicting with a super-large vocabulary, we factorize it into two sub-vocabulary of different sizes by asymmetric token factorization, and further introduce “next sub-token prediction” to enhance sub-token interaction for better generation quality. We release all models and codes to foster innovation and creativity in the field of auto-regressive visual generation.

[AI-6] HiSC4D: Human-centered interaction and 4D Scene Capture in Large-scale Space Using Wearable IMUs and LiDAR

链接: https://arxiv.org/abs/2409.04398
作者: Yudi Dai,Zhiyong Wang,Xiping Lin,Chenglu Wen,Lan Xu,Siqi Shen,Yuexin Ma,Cheng Wang
关键词-EN: rich human-human interactions, large-scale indoor-outdoor scenes, dynamic digital world, diverse human motions, Scene Capture method
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Multimedia (cs.MM)
*备注: 17 pages, 10 figures, Jornal

点击查看摘要

Abstract:We introduce HiSC4D, a novel Human-centered interaction and 4D Scene Capture method, aimed at accurately and efficiently creating a dynamic digital world, containing large-scale indoor-outdoor scenes, diverse human motions, rich human-human interactions, and human-environment interactions. By utilizing body-mounted IMUs and a head-mounted LiDAR, HiSC4D can capture egocentric human motions in unconstrained space without the need for external devices and pre-built maps. This affords great flexibility and accessibility for human-centered interaction and 4D scene capturing in various environments. Taking into account that IMUs can capture human spatially unrestricted poses but are prone to drifting for long-period using, and while LiDAR is stable for global localization but rough for local positions and orientations, HiSC4D employs a joint optimization method, harmonizing all sensors and utilizing environment cues, yielding promising results for long-term capture in large scenes. To promote research of egocentric human interaction in large scenes and facilitate downstream tasks, we also present a dataset, containing 8 sequences in 4 large scenes (200 to 5,000 m^2 ), providing 36k frames of accurate 4D human motions with SMPL annotations and dynamic scenes, 31k frames of cropped human point clouds, and scene mesh of the environment. A variety of scenarios, such as the basketball gym and commercial street, alongside challenging human motions, such as daily greeting, one-on-one basketball playing, and tour guiding, demonstrate the effectiveness and the generalization ability of HiSC4D. The dataset and code will be publicated on this http URL available for research purposes.

[AI-7] Question-Answering Dense Video Events

链接: https://arxiv.org/abs/2409.04388
作者: Hangyu Qin,Junbin Xiao,Angela Yao
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, Large Language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown excellent performance in question-answering of single-event videos. In this paper, we present question-answering dense video events, a novel task that requires answering and grounding the dense-event questions in long videos, thus challenging MLLMs to faithfully comprehend and reason about multiple events occurring over extended time periods. To facilitate the study, we construct DeVE-QA - a dataset featuring 78K questions about 26K events on 10.6K long videos. We then benchmark and show that existing MLLMs excelling at single-event QA struggle to perform well in DeVE-QA. For improvement, we propose DeVi, a novel training-free MLLM approach that highlights a hierarchical captioning module, a temporal event memory module, and a self-consistency checking module to respectively detect, contextualize and memorize, and ground dense-events in long videos for question answering. Extensive experiments show that DeVi is superior at answering dense-event questions and grounding relevant video moments. Compared with existing MLLMs, it achieves a remarkable increase of 4.1 percent and 3.7 percent for G(round)QA accuracy on DeVE-QA and NExT-GQA respectively.

[AI-8] Provable Hyperparameter Tuning for Structured Pfaffian Settings

链接: https://arxiv.org/abs/2409.04367
作者: Maria-Florina Balcan,Anh Tuan Nguyen,Dravyansh Sharma
关键词-EN: Data-driven algorithm design, specific application domains, Data-driven algorithm, achieving better performance, algorithm design
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Data-driven algorithm design automatically adapts algorithms to specific application domains, achieving better performance. In the context of parameterized algorithms, this approach involves tuning the algorithm parameters using problem instances drawn from the problem distribution of the target application domain. While empirical evidence supports the effectiveness of data-driven algorithm design, providing theoretical guarantees for several parameterized families remains challenging. This is due to the intricate behaviors of their corresponding utility functions, which typically admit piece-wise and discontinuity structures. In this work, we present refined frameworks for providing learning guarantees for parameterized data-driven algorithm design problems in both distributional and online learning settings. For the distributional learning setting, we introduce the Pfaffian GJ framework, an extension of the classical GJ framework, capable of providing learning guarantees for function classes for which the computation involves Pfaffian functions. Unlike the GJ framework, which is limited to function classes with computation characterized by rational functions, our proposed framework can deal with function classes involving Pfaffian functions, which are much more general and widely applicable. We then show that for many parameterized algorithms of interest, their utility function possesses a refined piece-wise structure, which automatically translates to learning guarantees using our proposed framework. For the online learning setting, we provide a new tool for verifying dispersion property of a sequence of loss functions. This sufficient condition allows no-regret learning for sequences of piece-wise structured loss functions where the piece-wise structure involves Pfaffian transition boundaries.

[AI-9] Connectivity-Inspired Network for Context-Aware Recognition ECCV2024

链接: https://arxiv.org/abs/2409.04360
作者: Gianluca Carloni,Sara Colantonio
关键词-EN: paper is threefold, Contextual Attention Block, human visual system, extensive literature review, motivated neural network
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注: ECCV 2024 - HCV Workshop, Accepted for presentation, Submitted Manuscript Version (adapted to include author names, Acknowledgements, and reference DOIs): the version of the manuscript improved after peer review will appear in the Proceedings later

点击查看摘要

Abstract:The aim of this paper is threefold. We inform the AI practitioner about the human visual system with an extensive literature review; we propose a novel biologically motivated neural network for image classification; and, finally, we present a new plug-and-play module to model context awareness. We focus on the effect of incorporating circuit motifs found in biological brains to address visual recognition. Our convolutional architecture is inspired by the connectivity of human cortical and subcortical streams, and we implement bottom-up and top-down modulations that mimic the extensive afferent and efferent connections between visual and cognitive areas. Our Contextual Attention Block is simple and effective and can be integrated with any feed-forward neural network. It infers weights that multiply the feature maps according to their causal influence on the scene, modeling the co-occurrence of different objects in the image. We place our module at different bottlenecks to infuse a hierarchical context awareness into the model. We validated our proposals through image classification experiments on benchmark data and found a consistent improvement in performance and the robustness of the produced explanations via class activation. Our code is available at this https URL.

[AI-10] owards Fine-Grained Webpage Fingerprinting at Scale CCS

链接: https://arxiv.org/abs/2409.04341
作者: Xiyuan Zhao,Xinhao Deng,Qi Li,Yunpeng Liu,Zhuotao Liu,Kun Sun,Ke Xu
关键词-EN: visited by Tor, analyzing encrypted traffic, Tor clients, Website Fingerprinting, traffic patterns
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: To appear in the Proceedings of The ACM Conference on Computer and Communications Security (CCS), 2024

点击查看摘要

Abstract:Website Fingerprinting (WF) attacks can effectively identify the websites visited by Tor clients via analyzing encrypted traffic patterns. Existing attacks focus on identifying different websites, but their accuracy dramatically decreases when applied to identify fine-grained webpages, especially when distinguishing among different subpages of the same website. WebPage Fingerprinting (WPF) attacks face the challenges of highly similar traffic patterns and a much larger scale of webpages. Furthermore, clients often visit multiple webpages concurrently, increasing the difficulty of extracting the traffic patterns of each webpage from the obfuscated traffic. In this paper, we propose Oscar, a WPF attack based on multi-label metric learning that identifies different webpages from obfuscated traffic by transforming the feature space. Oscar can extract the subtle differences among various webpages, even those with similar traffic patterns. In particular, Oscar combines proxy-based and sample-based metric learning losses to extract webpage features from obfuscated traffic and identify multiple webpages. We prototype Oscar and evaluate its performance using traffic collected from 1,000 monitored webpages and over 9,000 unmonitored webpages in the real world. Oscar demonstrates an 88.6% improvement in the multi-label metric Recall@5 compared to the state-of-the-art attacks.

[AI-11] AGR: Age Group fairness Reward for Bias Mitigation in LLMs

链接: https://arxiv.org/abs/2409.04340
作者: Shuirong Cao,Ruoxi Cheng,Zhiqiang Wang
关键词-EN: exhibit age biases, resulting in unequal, unequal treatment, treatment of individuals, age
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: The first two authors contributed equally to this work. Corresponding to Zhiqiang Wang. ACKNOWLEDGMENT: we would like to thank the computing resources support from the State Key Laboratory of New Computer Software Technologies at Nanjing University

点击查看摘要

Abstract:LLMs can exhibit age biases, resulting in unequal treatment of individuals across age groups. While much research has addressed racial and gender biases, age bias remains little explored. The scarcity of instruction-tuning and preference datasets for age bias hampers its detection and measurement, and existing fine-tuning methods seldom address age-related fairness. In this paper, we construct age bias preference datasets and instruction-tuning datasets for RLHF. We introduce ARG, an age fairness reward to reduce differences in the response quality of LLMs across different age groups. Extensive experiments demonstrate that this reward significantly improves response accuracy and reduces performance disparities across age groups. Our source code and datasets are available at the anonymous \hrefhttps://anonymous.4open.science/r/FairRLHF-D445/readme.mdlink.

[AI-12] Learning vs Retrieval: The Role of In-Context Examples in Regression with LLMs

链接: https://arxiv.org/abs/2409.04318
作者: Aliakbar Nafar,Kristen Brent Venable,Parisa Kordjamshidi
关键词-EN: Generative Large Language, Large Language Models, Generative Large, Large Language, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative Large Language Models (LLMs) are capable of being in-context learners. However, the underlying mechanism of in-context learning (ICL) is still a major research question, and experimental research results about how models exploit ICL are not always consistent. In this work, we propose a framework for evaluating in-context learning mechanisms, which we claim are a combination of retrieving internal knowledge and learning from in-context examples by focusing on regression tasks. First, we show that LLMs can perform regression on real-world datasets and then design experiments to measure the extent to which the LLM retrieves its internal knowledge versus learning from in-context examples. We argue that this process lies on a spectrum between these two extremes. We provide an in-depth analysis of the degrees to which these mechanisms are triggered depending on various factors, such as prior knowledge about the tasks and the type and richness of the information provided by the in-context examples. We employ three LLMs and utilize multiple datasets to corroborate the robustness of our findings. Our results shed light on how to engineer prompts to leverage meta-learning from in-context examples and foster knowledge retrieval depending on the problem being addressed.

[AI-13] Safe and Efficient Path Planning under Uncertainty via Deep Collision Probability Fields

链接: https://arxiv.org/abs/2409.04306
作者: Felix Herrmann,Sebastian Zach,Jacopo Banfi,Jan Peters,Georgia Chalvatzaki,Davide Tateo
关键词-EN: Estimating collision probabilities, Deep Collision Probability, Collision Probability Fields, Estimating collision, collision probabilities
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Preprint version of a paper accepted to the IEEE Robotics and Automation Letters

点击查看摘要

Abstract:Estimating collision probabilities between robots and environmental obstacles or other moving agents is crucial to ensure safety during path planning. This is an important building block of modern planning algorithms in many application scenarios such as autonomous driving, where noisy sensors perceive obstacles. While many approaches exist, they either provide too conservative estimates of the collision probabilities or are computationally intensive due to their sampling-based nature. To deal with these issues, we introduce Deep Collision Probability Fields, a neural-based approach for computing collision probabilities of arbitrary objects with arbitrary unimodal uncertainty distributions. Our approach relegates the computationally intensive estimation of collision probabilities via sampling at the training step, allowing for fast neural network inference of the constraints during planning. In extensive experiments, we show that Deep Collision Probability Fields can produce reasonably accurate collision probabilities (up to 10^-3) for planning and that our approach can be easily plugged into standard path planning approaches to plan safe paths on 2-D maps containing uncertain static and dynamic obstacles. Additional material, code, and videos are available at this https URL.

[AI-14] CoxKAN: Kolmogorov-Arnold Networks for Interpretable High-Performance Survival Analysis

链接: https://arxiv.org/abs/2409.04290
作者: William Knottenbelt,Zeyu Gao,Rebecca Wray,Woody Zhidong Zhang,Jiashuai Liu,Mireia Crispin-Ortuzar
关键词-EN: specific event occurs, branch of statistics, modeling the time, specific event, event occurs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Survival analysis is a branch of statistics used for modeling the time until a specific event occurs and is widely used in medicine, engineering, finance, and many other fields. When choosing survival models, there is typically a trade-off between performance and interpretability, where the highest performance is achieved by black-box models based on deep learning. This is a major problem in fields such as medicine where practitioners are reluctant to blindly trust black-box models to make important patient decisions. Kolmogorov-Arnold Networks (KANs) were recently proposed as an interpretable and accurate alternative to multi-layer perceptrons (MLPs). We introduce CoxKAN, a Cox proportional hazards Kolmogorov-Arnold Network for interpretable, high-performance survival analysis. We evaluate the proposed CoxKAN on 4 synthetic datasets and 9 real medical datasets. The synthetic experiments demonstrate that CoxKAN accurately recovers interpretable symbolic formulae for the hazard function, and effectively performs automatic feature selection. Evaluation on the 9 real datasets show that CoxKAN consistently outperforms the Cox proportional hazards model and achieves performance that is superior or comparable to that of tuned MLPs. Furthermore, we find that CoxKAN identifies complex interactions between predictor variables that would be extremely difficult to recognise using existing survival methods, and automatically finds symbolic formulae which uncover the precise effect of important biomarkers on patient risk.

[AI-15] Using Large Language Models to Generate Authentic Multi-agent Knowledge Work Datasets

链接: https://arxiv.org/abs/2409.04286
作者: Desiree Heim,Christian Jilek,Adrian Ulges,Andreas Dengel
关键词-EN: collections lack diversity, Current publicly, data collections lack, knowledge work, work data collections
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted and in press (INFORMATIK Festival, Wiesbaden, 2024)

点击查看摘要

Abstract:Current publicly available knowledge work data collections lack diversity, extensive annotations, and contextual information about the users and their documents. These issues hinder objective and comparable data-driven evaluations and optimizations of knowledge work assistance systems. Due to the considerable resources needed to collect such data in real-life settings and the necessity of data censorship, collecting such a dataset appears nearly impossible. For this reason, we propose a configurable, multi-agent knowledge work dataset generator. This system simulates collaborative knowledge work among agents producing Large Language Model-generated documents and accompanying data traces. Additionally, the generator captures all background information, given in its configuration or created during the simulation process, in a knowledge graph. Finally, the resulting dataset can be utilized and shared without privacy or confidentiality concerns. This paper introduces our approach’s design and vision and focuses on generating authentic knowledge work documents using Large Language Models. Our study involving human raters who assessed 53% of the generated and 74% of the real documents as realistic demonstrates the potential of our approach. Furthermore, we analyze the authenticity criteria mentioned in the participants’ comments and elaborate on potential improvements for identified common issues. Comments: Accepted and in press (INFORMATIK Festival, Wiesbaden, 2024) Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2409.04286 [cs.AI] (or arXiv:2409.04286v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2409.04286 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-16] Cycle Pixel Difference Network for Crisp Edge Detection

链接: https://arxiv.org/abs/2409.04272
作者: Changsong Liu,Wei Zhang,Yanyan Liu,Mingyang Li,Wenlin Li,Yimeng Fan,Xiangnan Bai,Liang Zhangd
关键词-EN: garnered increasing attention, computer vision, increasing attention, fundamental task, task in computer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Edge detection, as a fundamental task in computer vision, has garnered increasing attention. The advent of deep learning has significantly advanced this field. However, recent deep learning-based methods which rely on large-scale pre-trained weights cannot be trained from scratch, with very limited research addressing this issue. This paper proposes a novel cycle pixel difference convolution (CPDC), which effectively integrates image gradient information with modern convolution operations. Based on the CPDC, we develop a U-shape encoder-decoder model named CPD-Net, which is a purely end-to-end network. Additionally, to address the issue of edge thickness produced by most existing methods, we construct a multi-scale information enhancement module (MSEM) to enhance the discriminative ability of the model, thereby generating crisp and clean contour maps. Comprehensive experiments conducted on three standard benchmarks demonstrate that our method achieves competitive performance on the BSDS500 dataset (ODS=0.813), NYUD-V2 (ODS=0.760), and BIPED dataset (ODS=0.898). Our approach provides a novel perspective for addressing these challenges in edge detection.

[AI-17] An overview of domain-specific foundation model: key technologies applications and challenges

链接: https://arxiv.org/abs/2409.04267
作者: Haolong Chen,Hanzhi Chen,Zijian Zhao,Kaifeng Han,Guangxu Zhu,Yichen Zhao,Ying Du,Wei Xu,Qingjiang Shi
关键词-EN: human language understanding, domain-specific foundation models, products in human, application scenarios, foundation models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The impressive performance of ChatGPT and other foundation-model-based products in human language understanding has prompted both academia and industry to explore how these models can be tailored for specific industries and application scenarios. This process, known as the customization of domain-specific foundation models, addresses the limitations of general-purpose models, which may not fully capture the unique patterns and requirements of domain-specific data. Despite its importance, there is a notable lack of comprehensive overview papers on building domain-specific foundation models, while numerous resources exist for general-purpose models. To bridge this gap, this article provides a timely and thorough overview of the methodology for customizing domain-specific foundation models. It introduces basic concepts, outlines the general architecture, and surveys key methods for constructing domain-specific models. Furthermore, the article discusses various domains that can benefit from these specialized models and highlights the challenges ahead. Through this overview, we aim to offer valuable guidance and reference for researchers and practitioners from diverse fields to develop their own customized foundation models.

[AI-18] Hermes: Memory-Efficient Pipeline Inference for Large Models on Edge Devices

链接: https://arxiv.org/abs/2409.04249
作者: Xueyuan Han,Zinuo Cai,Yichu Zhang,Chongxin Fan,Junhan Liu,Ruhui Ma,Rajkumar Buyya
关键词-EN: achieved numerous success, recent years, achieved numerous, numerous success, success in recent
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by the 42nd IEEE International Conference on Computer Design (ICCD 2024)

点击查看摘要

Abstract:The application of Transformer-based large models has achieved numerous success in recent years. However, the exponential growth in the parameters of large models introduces formidable memory challenge for edge deployment. Prior works to address this challenge mainly focus on optimizing the model structure and adopting memory swapping methods. However, the former reduces the inference accuracy, and the latter raises the inference latency. This paper introduces PIPELOAD, a novel memory-efficient pipeline execution mechanism. It reduces memory usage by incorporating dynamic memory management and minimizes inference latency by employing parallel model loading. Based on PIPELOAD mechanism, we present Hermes, a framework optimized for large model inference on edge devices. We evaluate Hermes on Transformer-based models of different sizes. Our experiments illustrate that Hermes achieves up to 4.24 X increase in inference speed and 86.7% lower memory consumption than the state-of-the-art pipeline mechanism for BERT and ViT models, 2.58 X increase in inference speed and 90.3% lower memory consumption for GPT-style models.

[AI-19] WarpAdam: A new Adam optimizer based on Meta-Learning approach

链接: https://arxiv.org/abs/2409.04244
作者: Chengxi Pan,Junshang Chen,Jingrui Ye
关键词-EN: Adam optimizer, Adam, algorithms is crucial, optimizer, Meta Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Optimal selection of optimization algorithms is crucial for training deep learning models. The Adam optimizer has gained significant attention due to its efficiency and wide applicability. However, to enhance the adaptability of optimizers across diverse datasets, we propose an innovative optimization strategy by integrating the 'warped gradient descend’concept from Meta Learning into the Adam optimizer. In the conventional Adam optimizer, gradients are utilized to compute estimates of gradient mean and variance, subsequently updating model parameters. Our approach introduces a learnable distortion matrix, denoted as P, which is employed for linearly transforming gradients. This transformation slightly adjusts gradients during each iteration, enabling the optimizer to better adapt to distinct dataset characteristics. By learning an appropriate distortion matrix P, our method aims to adaptively adjust gradient information across different data distributions, thereby enhancing optimization performance. Our research showcases the potential of this novel approach through theoretical insights and empirical evaluations. Experimental results across various tasks and datasets validate the superiority of our optimizer that integrates the ‘warped gradient descend’ concept in terms of adaptability. Furthermore, we explore effective strategies for training the adaptation matrix P and identify scenarios where this method can yield optimal results. In summary, this study introduces an innovative approach that merges the ‘warped gradient descend’ concept from Meta Learning with the Adam optimizer. By introducing a learnable distortion matrix P within the optimizer, we aim to enhance the model’s generalization capability across diverse data distributions, thus opening up new possibilities in the field of deep learning optimization.

[AI-20] SPACE: A Python-based Simulator for Evaluating Decentralized Multi-Robot Task Allocation Algorithms

链接: https://arxiv.org/abs/2409.04230
作者: Inmo Jang
关键词-EN: achieve collective goals, Swarm robotics explores, collective goals, achieve collective, central focus
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Swarm robotics explores the coordination of multiple robots to achieve collective goals, with collective decision-making being a central focus. This process involves decentralized robots autonomously making local decisions and communicating them, which influences the overall emergent behavior. Testing such decentralized algorithms in real-world scenarios with hundreds or more robots is often impractical, underscoring the need for effective simulation tools. We propose SPACE (Swarm Planning and Control Evaluation), a Python-based simulator designed to support the research, evaluation, and comparison of decentralized Multi-Robot Task Allocation (MRTA) algorithms. SPACE streamlines core algorithmic development by allowing users to implement decision-making algorithms as Python plug-ins, easily construct agent behavior trees via an intuitive GUI, and leverage built-in support for inter-agent communication and local task awareness. To demonstrate its practical utility, we implement and evaluate CBBA and GRAPE within the simulator, comparing their performance across different metrics, particularly in scenarios with dynamically introduced tasks. This evaluation shows the usefulness of SPACE in conducting rigorous and standardized comparisons of MRTA algorithms, helping to support future research in the field.

[AI-21] Advancing Multi-Organ Disease Care: A Hierarchical Multi-Agent Reinforcement Learning Framework

链接: https://arxiv.org/abs/2409.04224
作者: Daniel J. Tan,Qianyi Xu,Kay Choong See,Dilruk Perera,Mengling Feng
关键词-EN: significant challenges due, present significant challenges, diseases present significant, multiple organ systems, present significant
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-organ diseases present significant challenges due to their simultaneous impact on multiple organ systems, necessitating complex and adaptive treatment strategies. Despite recent advancements in AI-powered healthcare decision support systems, existing solutions are limited to individual organ systems. They often ignore the intricate dependencies between organ system and thereby fails to provide holistic treatment recommendations that are useful in practice. We propose a novel hierarchical multi-agent reinforcement learning (HMARL) framework to address these challenges. This framework uses dedicated agents for each organ system, and model dynamic through explicit inter-agent communication channels, enabling coordinated treatment strategies across organs. Furthermore, we introduce a dual-layer state representation technique to contextualize patient conditions at various hierarchical levels, enhancing the treatment accuracy and relevance. Through extensive qualitative and quantitative evaluations in managing sepsis (a complex multi-organ disease), our approach demonstrates its ability to learn effective treatment policies that significantly improve patient survival rates. This framework marks a substantial advancement in clinical decision support systems, pioneering a comprehensive approach for multi-organ treatment recommendations.

[AI-22] GST: Precise 3D Human Body from a Single Image with Gaussian Splatting Transformers

链接: https://arxiv.org/abs/2409.04196
作者: Lorenza Prospero,Abdullah Hamdi,Joao F. Henriques,Christian Rupprecht
关键词-EN: Reconstructing realistic, human-computer interfaces, creative industries, significant applications, applications in creative
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: preprint

点击查看摘要

Abstract:Reconstructing realistic 3D human models from monocular images has significant applications in creative industries, human-computer interfaces, and healthcare. We base our work on 3D Gaussian Splatting (3DGS), a scene representation composed of a mixture of Gaussians. Predicting such mixtures for a human from a single input image is challenging, as it is a non-uniform density (with a many-to-one relationship with input pixels) with strict physical constraints. At the same time, it needs to be flexible to accommodate a variety of clothes and poses. Our key observation is that the vertices of standardized human meshes (such as SMPL) can provide an adequate density and approximate initial position for Gaussians. We can then train a transformer model to jointly predict comparatively small adjustments to these positions, as well as the other Gaussians’ attributes and the SMPL parameters. We show empirically that this combination (using only multi-view supervision) can achieve fast inference of 3D human models from a single image without test-time optimization, expensive diffusion models, or 3D points supervision. We also show that it can improve 3D pose estimation by better fitting human models that account for clothes and other variations. The code is available on the project website this https URL .

[AI-23] owards Privacy-Preserving Relational Data Synthesis via Probabilistic Relational Models

链接: https://arxiv.org/abs/2409.04194
作者: Malte Luttermann,Ralf Möller,Mattis Hartwig
关键词-EN: combine first-order logic, relational, provide a well-established, well-established formalism, formalism to combine
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
*备注: Accepted to the Proceedings of the 47th German Conference on Artificial Intelligence (KI 2024)

点击查看摘要

Abstract:Probabilistic relational models provide a well-established formalism to combine first-order logic and probabilistic models, thereby allowing to represent relationships between objects in a relational domain. At the same time, the field of artificial intelligence requires increasingly large amounts of relational training data for various machine learning tasks. Collecting real-world data, however, is often challenging due to privacy concerns, data protection regulations, high costs, and so on. To mitigate these challenges, the generation of synthetic data is a promising approach. In this paper, we solve the problem of generating synthetic relational data via probabilistic relational models. In particular, we propose a fully-fledged pipeline to go from relational database to probabilistic relational model, which can then be used to sample new synthetic relational data points from its underlying probability distribution. As part of our proposed pipeline, we introduce a learning algorithm to construct a probabilistic relational model from a given relational database.

[AI-24] GALLa: Graph Aligned Large Language Models for Improved Source Code Understanding

链接: https://arxiv.org/abs/2409.04183
作者: Ziyin Zhang,Hang Yu,Shijie Li,Peng Di,Jianguo Li,Rui Wang
关键词-EN: Programming languages possess, possess rich semantic, languages possess rich, rich semantic information, Programming languages
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Programming languages possess rich semantic information such as data flow that is represented by graphs and not available from the surface form of source code. Recent code language models have scaled to billions of parameters, but model source code solely as text tokens while ignoring any other structural information. Conversely, models that do encode structural information of code make modifications to the Transformer architecture, limiting their scale and compatibility with pretrained LLMs. In this work, we take the best of both worlds with GALLa - Graph Aligned Large Language Model. GALLa utilizes graph neural networks and cross-modal alignment technologies to inject the structural information of code into LLMs as an auxiliary task during finetuning. This framework is both model-agnostic and task-agnostic, as it can be applied to any code LLM for any code downstream task, and requires the structural graph data only at training time from a corpus unrelated to the finetuning data, while incurring no cost at inference time over the baseline LLM. Experiments on five code tasks with four different baseline LLMs ranging in size from 350M to 8B validate the effectiveness of GALLa, demonstrating consistent improvement over the baseline, even for powerful models such as LLaMA3.

[AI-25] he Prevalence of Neural Collapse in Neural Multivariate Regression

链接: https://arxiv.org/abs/2409.04180
作者: George Andriopoulos,Zixuan Dong,Li Guo,Zifan Zhao,Keith Ross
关键词-EN: last-layer feature vectors, Neural Collapse, exhibit Neural Collapse, feature vectors collapse, Neural Regression Collapse
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently it has been observed that neural networks exhibit Neural Collapse (NC) during the final stage of training for the classification problem. We empirically show that multivariate regression, as employed in imitation learning and other applications, exhibits Neural Regression Collapse (NRC), a new form of neural collapse: (NRC1) The last-layer feature vectors collapse to the subspace spanned by the n principal components of the feature vectors, where n is the dimension of the targets (for univariate regression, n=1 ); (NRC2) The last-layer feature vectors also collapse to the subspace spanned by the last-layer weight vectors; (NRC3) The Gram matrix for the weight vectors converges to a specific functional form that depends on the covariance matrix of the targets. After empirically establishing the prevalence of (NRC1)-(NRC3) for a variety of datasets and network architectures, we provide an explanation of these phenomena by modeling the regression task in the context of the Unconstrained Feature Model (UFM), in which the last layer feature vectors are treated as free variables when minimizing the loss function. We show that when the regularization parameters in the UFM model are strictly positive, then (NRC1)-(NRC3) also emerge as solutions in the UFM optimization problem. We also show that if the regularization parameters are equal to zero, then there is no collapse. To our knowledge, this is the first empirical and theoretical study of neural collapse in the context of regression. This extension is significant not only because it broadens the applicability of neural collapse to a new category of problems but also because it suggests that the phenomena of neural collapse could be a universal behavior in deep learning.

[AI-26] From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks

链接: https://arxiv.org/abs/2409.04168
作者: Andreas Stephan,Dawei Zhu,Matthias Aßenmacher,Xiaoyu Shen,Benjamin Roth
关键词-EN: large language models, LLM judges, large language, judges, study LLM judges
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:To reduce the need for human annotations, large language models (LLMs) have been proposed as judges of the quality of other candidate models. LLM judges are typically evaluated by measuring the correlation with human judgments on generation tasks such as summarization or machine translation. In contrast, we study LLM judges on mathematical reasoning tasks. These tasks require multi-step reasoning, and the correctness of their solutions is verifiable, enabling a more objective evaluation. We perform a detailed performance analysis and find that the used judges are mostly unable to improve task performance but are able to pick the better model. Our analysis uncovers a strong correlation between judgment performance and the candidate model task performance. We observe that judges tend to choose the model of higher quality even if its answer is incorrect. Further, we show that it is possible to use statistics, such as the task performances of the individual models, to predict judgment performance. In an ablation, we either swap or mask the candidate answers and observe that judges often keep the original judgment, providing evidence that judges incorporate writing style in their judgments. In summary, we find that regularities in the judgments are quantifiable using statistical measures and provide various angles on exploiting them.

[AI-27] Context is the Key: Backdoor Attacks for In-Context Learning with Vision Transformers

链接: https://arxiv.org/abs/2409.04142
作者: Gorka Abad,Stjepan Picek,Lorenzo Cavallaro,Aitor Urbieta
关键词-EN: pretrained models downloaded, practitioners commonly, untrusted sources, high cost, commonly use pretrained
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Due to the high cost of training, large model (LM) practitioners commonly use pretrained models downloaded from untrusted sources, which could lead to owning compromised models. In-context learning is the ability of LMs to perform multiple tasks depending on the prompt or context. This can enable new attacks, such as backdoor attacks with dynamic behavior depending on how models are prompted. In this paper, we leverage the ability of vision transformers (ViTs) to perform different tasks depending on the prompts. Then, through data poisoning, we investigate two new threats: i) task-specific backdoors where the attacker chooses a target task to attack, and only the selected task is compromised at test time under the presence of the trigger. At the same time, any other task is not affected, even if prompted with the trigger. We succeeded in attacking every tested model, achieving up to 89.90% degradation on the target task. ii) We generalize the attack, allowing the backdoor to affect \emphany task, even tasks unseen during the training phase. Our attack was successful on every tested model, achieving a maximum of 13\times degradation. Finally, we investigate the robustness of prompts and fine-tuning as techniques for removing the backdoors from the model. We found that these methods fall short and, in the best case, reduce the degradation from 89.90% to 73.46%. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.04142 [cs.CR] (or arXiv:2409.04142v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2409.04142 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-28] Confidence-Aware Document OCR Error Detection

链接: https://arxiv.org/abs/2409.04117
作者: Arthur Hemmer,Mickaël Coustaty,Nicola Bartolo,Jean-Marc Ogier
关键词-EN: Optical Character Recognition, Optical Character, Character Recognition, impact subsequent applications, OCR confidence scores
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Optical Character Recognition (OCR) continues to face accuracy challenges that impact subsequent applications. To address these errors, we explore the utility of OCR confidence scores for enhancing post-OCR error detection. Our study involves analyzing the correlation between confidence scores and error rates across different OCR systems. We develop ConfBERT, a BERT-based model that incorporates OCR confidence scores into token embeddings and offers an optional pre-training phase for noise adjustment. Our experimental results demonstrate that integrating OCR confidence scores can enhance error detection capabilities. This work underscores the importance of OCR confidence scores in improving detection accuracy and reveals substantial disparities in performance between commercial and open-source OCR technologies.

[AI-29] Multi-Programming Language Ensemble for Code Generation in Large Language Model

链接: https://arxiv.org/abs/2409.04114
作者: Tengfei Xue,Xuefeng Li,Tahir Azim,Roman Smirnov,Jianhui Yu,Arash Sadrieh,Babak Pahlavan
关键词-EN: Large language models, Large language, significantly improved code, significantly improved, code generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Code available at this https URL

点击查看摘要

Abstract:Large language models (LLMs) have significantly improved code generation, particularly in one-pass code generation. However, most existing approaches focus solely on generating code in a single programming language, overlooking the potential of leveraging the multi-language capabilities of LLMs. LLMs have varying patterns of errors across different languages, suggesting that a more robust approach could be developed by leveraging these multi-language outputs. In this study, we propose Multi-Programming Language Ensemble (MPLE), a novel ensemble-based method that utilizes code generation across multiple programming languages to enhance overall performance. By treating each language-specific code generation process as an individual “weak expert” and effectively integrating their outputs, our method mitigates language-specific errors and biases. This multi-language ensemble strategy leverages the complementary strengths of different programming languages, enabling the model to produce more accurate and robust code. Our approach can be seamlessly integrated with commonly used techniques such as the reflection algorithm and Monte Carlo tree search to improve code generation quality further. Experimental results show that our framework consistently enhances baseline performance by up to 17.92% on existing benchmarks (HumanEval and HumanEval-plus), with a standout result of 96.25% accuracy on the HumanEval benchmark, achieving new state-of-the-art results across various LLM models. The code will be released at this https URL

[AI-30] Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100 NLP Researchers

链接: https://arxiv.org/abs/2409.04109
作者: Chenglei Si,Diyi Yang,Tatsunori Hashimoto
关键词-EN: large language models, accelerate scientific discovery, Recent advancements, works proposing research, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: main paper is 20 pages

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have sparked optimism about their potential to accelerate scientific discovery, with a growing number of works proposing research agents that autonomously generate and validate new ideas. Despite this, no evaluations have shown that LLM systems can take the very first step of producing novel, expert-level ideas, let alone perform the entire research process. We address this by establishing an experimental design that evaluates research idea generation while controlling for confounders and performs the first head-to-head comparison between expert NLP researchers and an LLM ideation agent. By recruiting over 100 NLP researchers to write novel ideas and blind reviews of both LLM and human ideas, we obtain the first statistically significant conclusion on current LLM capabilities for research ideation: we find LLM-generated ideas are judged as more novel (p 0.05) than human expert ideas while being judged slightly weaker on feasibility. Studying our agent baselines closely, we identify open problems in building and evaluating research agents, including failures of LLM self-evaluation and their lack of diversity in generation. Finally, we acknowledge that human judgements of novelty can be difficult, even by experts, and propose an end-to-end study design which recruits researchers to execute these ideas into full projects, enabling us to study whether these novelty and feasibility judgements result in meaningful differences in research outcome.

[AI-31] MixNet: Joining Force of Classical and Modern Approaches Toward the Comprehensive Pipeline in Motor Imagery EEG Classification

链接: https://arxiv.org/abs/2409.04104
作者: Phairot Autthasan,Rattanaphon Chaisaen,Huy Phan,Maarten De Vos,Theerawit Wilaiprasitporn
关键词-EN: impacted motor imagery, significantly impacted motor, Recent advances, based brain-computer interface, motor imagery
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
*备注: Supplementary materials and source codes are available on-line at this https URL

点击查看摘要

Abstract:Recent advances in deep learning (DL) have significantly impacted motor imagery (MI)-based brain-computer interface (BCI) systems, enhancing the decoding of electroencephalography (EEG) signals. However, most studies struggle to identify discriminative patterns across subjects during MI tasks, limiting MI classification performance. In this article, we propose MixNet, a novel classification framework designed to overcome this limitation by utilizing spectral-spatial signals from MI data, along with a multitask learning architecture named MIN2Net, for classification. Here, the spectral-spatial signals are generated using the filter-bank common spatial patterns (FBCSPs) method on MI data. Since the multitask learning architecture is used for the classification task, the learning in each task may exhibit different generalization rates and potential overfitting across tasks. To address this issue, we implement adaptive gradient blending, simultaneously regulating multiple loss weights and adjusting the learning pace for each task based on its generalization/overfitting tendencies. Experimental results on six benchmark data sets of different data sizes demonstrate that MixNet consistently outperforms all state-of-the-art algorithms in subject-dependent and -independent settings. Finally, the low-density EEG MI classification results show that MixNet outperforms all state-of-the-art algorithms, offering promising implications for Internet of Thing (IoT) applications, such as lightweight and portable EEG wearable devices based on low-density montages.

[AI-32] he Role of Graph Topology in the Performance of Biomedical Knowledge Graph Completion Models

链接: https://arxiv.org/abs/2409.04103
作者: Alberto Cattaneo,Stephen Bonner,Thomas Martynec,Carlo Luschi,Ian P Barrett,Daniel Justus
关键词-EN: Knowledge Graph Completion, Knowledge Graph Embedding, Graph Embedding models, Knowledge Graph, Graph Completion
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Knowledge Graph Completion has been increasingly adopted as a useful method for several tasks in biomedical research, like drug repurposing or drug-target identification. To that end, a variety of datasets and Knowledge Graph Embedding models has been proposed over the years. However, little is known about the properties that render a dataset useful for a given task and, even though theoretical properties of Knowledge Graph Embedding models are well understood, their practical utility in this field remains controversial. We conduct a comprehensive investigation into the topological properties of publicly available biomedical Knowledge Graphs and establish links to the accuracy observed in real-world applications. By releasing all model predictions and a new suite of analysis tools we invite the community to build upon our work and continue improving the understanding of these crucial applications.

[AI-33] Intelligent tutoring systems by Bayesian networks with noisy gates

链接: https://arxiv.org/abs/2409.04102
作者: Alessandro Antonucci,Francesca Mangili,Claudio Bonesana,Giorgia Adorni
关键词-EN: Directed graphical models, Directed graphical, implement intelligent tutoring, implement intelligent, purely automatic
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Directed graphical models such as Bayesian nets are often used to implement intelligent tutoring systems able to interact in real-time with learners in a purely automatic way. When coping with such models, keeping a bound on the number of parameters might be important for multiple reasons. First, as these models are typically based on expert knowledge, a huge number of parameters to elicit might discourage practitioners from adopting them. Moreover, the number of model parameters affects the complexity of the inferences, while a fast computation of the queries is needed for real-time feedback. We advocate logical gates with uncertainty for a compact parametrization of the conditional probability tables in the underlying Bayesian net used by tutoring systems. We discuss the semantics of the model parameters to elicit and the assumptions required to apply such approach in this domain. We also derive a dedicated inference scheme to speed up computations.

[AI-34] SDformerFlow: Spatiotemporal swin spikeformer for event-based optical flow estimation

链接: https://arxiv.org/abs/2409.04082
作者: Yi Tian,Juan Andrade-Cetto
关键词-EN: event streams capturing, cameras generate asynchronous, Event cameras generate, sparse event streams, light intensity
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Under Review

点击查看摘要

Abstract:Event cameras generate asynchronous and sparse event streams capturing changes in light intensity. They offer significant advantages over conventional frame-based cameras, such as a higher dynamic range and an extremely faster data rate, making them particularly useful in scenarios involving fast motion or challenging lighting conditions. Spiking neural networks (SNNs) share similar asynchronous and sparse characteristics and are well-suited for processing data from event cameras. Inspired by the potential of transformers and spike-driven transformers (spikeformers) in other computer vision tasks, we propose two solutions for fast and robust optical flow estimation for event cameras: STTFlowNet and SDformerFlow. STTFlowNet adopts a U-shaped artificial neural network (ANN) architecture with spatiotemporal shifted window self-attention (swin) transformer encoders, while SDformerFlow presents its fully spiking counterpart, incorporating swin spikeformer encoders. Furthermore, we present two variants of the spiking version with different neuron models. Our work is the first to make use of spikeformers for dense optical flow estimation. We conduct end-to-end training for all models using supervised learning. Our results yield state-of-the-art performance among SNN-based event optical flow methods on both the DSEC and MVSEC datasets, and show significant reduction in power consumption compared to the equivalent ANNs.

[AI-35] UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity

链接: https://arxiv.org/abs/2409.04081
作者: Yicheng Fu,Raviteja Anantha,Prabal Vashisht,Jianpeng Cheng,Etai Littwin
关键词-EN: Generating user intent, Generating user, intent, user intent, Generating
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generating user intent from a sequence of user interface (UI) actions is a core challenge in comprehensive UI understanding. Recent advancements in multimodal large language models (MLLMs) have led to substantial progress in this area, but their demands for extensive model parameters, computing power, and high latency makes them impractical for scenarios requiring lightweight, on-device solutions with low latency or heightened privacy. Additionally, the lack of high-quality datasets has hindered the development of such lightweight models. To address these challenges, we propose UI-JEPA, a novel framework that employs masking strategies to learn abstract UI embeddings from unlabeled data through self-supervised learning, combined with an LLM decoder fine-tuned for user intent prediction. We also introduce two new UI-grounded multimodal datasets, “Intent in the Wild” (IIW) and “Intent in the Tame” (IIT), designed for few-shot and zero-shot UI understanding tasks. IIW consists of 1.7K videos across 219 intent categories, while IIT contains 914 videos across 10 categories. We establish the first baselines for these datasets, showing that representations learned using a JEPA-style objective, combined with an LLM decoder, can achieve user intent predictions that match the performance of state-of-the-art large MLLMs, but with significantly reduced annotation and deployment resources. Measured by intent similarity scores, UI-JEPA outperforms GPT-4 Turbo and Claude 3.5 Sonnet by 10.0% and 7.2% respectively, averaged across two datasets. Notably, UI-JEPA accomplishes the performance with a 50.5x reduction in computational cost and a 6.6x improvement in latency in the IIW dataset. These results underscore the effectiveness of UI-JEPA, highlighting its potential for lightweight, high-performance UI understanding.

[AI-36] AnyMatch – Efficient Zero-Shot Entity Matching with a Small Language Model

链接: https://arxiv.org/abs/2409.04073
作者: Zeyu Zhang,Paul Groth,Iacer Calixto,Sebastian Schelter
关键词-EN: records refer, product catalogs, catalogs or address, Entity matching, zero-shot entity matching
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注: 12 pages excluding references, 3 figures, and 5 tables

点击查看摘要

Abstract:Entity matching (EM) is the problem of determining whether two records refer to same real-world entity, which is crucial in data integration, e.g., for product catalogs or address databases. A major drawback of many EM approaches is their dependence on labelled examples. We thus focus on the challenging setting of zero-shot entity matching where no labelled examples are available for an unseen target dataset. Recently, large language models (LLMs) have shown promising results for zero-shot EM, but their low throughput and high deployment cost limit their applicability and scalability. We revisit the zero-shot EM problem with AnyMatch, a small language model fine-tuned in a transfer learning setup. We propose several novel data selection techniques to generate fine-tuning data for our model, e.g., by selecting difficult pairs to match via an AutoML filter, by generating additional attribute-level examples, and by controlling label imbalance in the data. We conduct an extensive evaluation of the prediction quality and deployment cost of our model, in a comparison to thirteen baselines on nine benchmark datasets. We find that AnyMatch provides competitive prediction quality despite its small parameter size: it achieves the second-highest F1 score overall, and outperforms several other approaches that employ models with hundreds of billions of parameters. Furthermore, our approach exhibits major cost benefits: the average prediction quality of AnyMatch is within 4.4% of the state-of-the-art method MatchGPT with the proprietary trillion-parameter model GPT-4, yet AnyMatch requires four orders of magnitude less parameters and incurs a 3,899 times lower inference cost (in dollars per 1,000 tokens). Comments: 12 pages excluding references, 3 figures, and 5 tables Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB) Cite as: arXiv:2409.04073 [cs.CL] (or arXiv:2409.04073v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.04073 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-37] An Argumentative Approach for Explaining Preemption in Soft-Constraint Based Norms ECAI2024

链接: https://arxiv.org/abs/2409.04065
作者: Wachara Fungwacharakorn,Kanae Tsushima,Hiroshi Hosobe,Hideaki Takeda,Ken Satoh
关键词-EN: challenging to understand, norms, Abstract, based, preemption
类目: Artificial Intelligence (cs.AI)
*备注: submitted to VECOMP/AICOM 2024 associated with 27th European Conference on Artificial Intelligence (ECAI2024)

点击查看摘要

Abstract:Although various aspects of soft-constraint based norms have been explored, it is still challenging to understand preemption. Preemption is a situation where higher-level norms override lower-level norms when new information emerges. To address this, we propose a derivation state argumentation framework (DSA-framework). DSA-framework incorporates derivation states to explain how preemption arises based on evolving situational knowledge. Based on DSA-framework, we present an argumentative approach for explaining preemption. We formally prove that, under local optimality, DSA-framework can provide explanations why one consequence is obligatory or forbidden by soft-constraint based norms represented as logical constraint hierarchies.

[AI-38] D4: Text-guided diffusion model-based domain adaptive data augmentation for vineyard shoot detection

链接: https://arxiv.org/abs/2409.04060
作者: Kentaro Hirahara,Chikahito Nakane,Hajime Ebisawa,Tsuyoshi Kuroda,Yohei Iwaki,Tomoyoshi Utsumi,Yuichiro Nomura,Makoto Koike,Hiroshi Mineno
关键词-EN: training data, data augmentation method, plant phenotyping, gaining attention, generative data augmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In an agricultural field, plant phenotyping using object detection models is gaining attention. However, collecting the training data necessary to create generic and high-precision models is extremely challenging due to the difficulty of annotation and the diversity of domains. Furthermore, it is difficult to transfer training data across different crops, and although machine learning models effective for specific environments, conditions, or crops have been developed, they cannot be widely applied in actual fields. In this study, we propose a generative data augmentation method (D4) for vineyard shoot detection. D4 uses a pre-trained text-guided diffusion model based on a large number of original images culled from video data collected by unmanned ground vehicles or other means, and a small number of annotated datasets. The proposed method generates new annotated images with background information adapted to the target domain while retaining annotation information necessary for object detection. In addition, D4 overcomes the lack of training data in agriculture, including the difficulty of annotation and diversity of domains. We confirmed that this generative data augmentation method improved the mean average precision by up to 28.65% for the BBox detection task and the average precision by up to 13.73% for the keypoint detection task for vineyard shoot detection. Our generative data augmentation method D4 is expected to simultaneously solve the cost and domain diversity issues of training data generation in agriculture and improve the generalization performance of detection models.

[AI-39] Refining Wikidata Taxonomy using Large Language Models

链接: https://arxiv.org/abs/2409.04056
作者: Yiwen Peng(IP Paris),Thomas Bonald(IP Paris),Mehwish Alam(IP Paris)
关键词-EN: Large Language Models, collaborative nature, taxonomic paths, presence of cycles, recurrent issues
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: ACM International Conference on Information and Knowledge Management, Oct 2024, Boise, Idaho, United States

点击查看摘要

Abstract:Due to its collaborative nature, Wikidata is known to have a complex taxonomy, with recurrent issues like the ambiguity between instances and classes, the inaccuracy of some taxonomic paths, the presence of cycles, and the high level of redundancy across classes. Manual efforts to clean up this taxonomy are time-consuming and prone to errors or subjective decisions. We present WiKC, a new version of Wikidata taxonomy cleaned automatically using a combination of Large Language Models (LLMs) and graph mining techniques. Operations on the taxonomy, such as cutting links or merging classes, are performed with the help of zero-shot prompting on an open-source LLM. The quality of the refined taxonomy is evaluated from both intrinsic and extrinsic perspectives, on a task of entity typing for the latter, showing the practical interest of WiKC.

[AI-40] A First Look At Efficient And Secure On-Device LLM Inference Against KV Leakage

链接: https://arxiv.org/abs/2409.04040
作者: Huan Yang,Deyu Zhang,Yudong Zhao,Yuanchun Li,Yunxin Liu
关键词-EN: Running LLMs, attention recently due, on-device LLM inference, privacy preservation, LLM inference
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Running LLMs on end devices has garnered significant attention recently due to their advantages in privacy preservation. With the advent of lightweight LLM models and specially designed GPUs, on-device LLM inference has achieved the necessary accuracy and performance metrics. However, we have identified that LLM inference on GPUs can leak privacy-sensitive intermediate information, specifically the KV pairs. An attacker could exploit these KV pairs to reconstruct the entire user conversation, leading to significant vulnerabilities. Existing solutions, such as Fully Homomorphic Encryption (FHE) and Trusted Execution Environments (TEE), are either too computation-intensive or resource-limited. To address these issues, we designed KV-Shield, which operates in two phases. In the initialization phase, it permutes the weight matrices so that all KV pairs are correspondingly permuted. During the runtime phase, the attention vector is inversely permuted to ensure the correctness of the layer output. All permutation-related operations are executed within the TEE, ensuring that insecure GPUs cannot access the original KV pairs, thus preventing conversation reconstruction. Finally, we theoretically analyze the correctness of KV-Shield, along with its advantages and overhead.

[AI-41] BFA-YOLO: Balanced multiscale object detection network for multi-view building facade attachments detection

链接: https://arxiv.org/abs/2409.04025
作者: Yangguang Chen,Tong Wang,Guanzhou Chen,Kun Zhu,Xiaoliang Tan,Jiaqi Wang,Hong Xie,Wenlin Zhou,Jingyi Zhao,Qing Wang,Xiaolong Luo,Xiaodong Zhang
关键词-EN: air conditioner units, glass curtain walls, curtain walls plays, facade attachments detection, facade attachments
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 22 pages

点击查看摘要

Abstract:Detection of building facade attachments such as doors, windows, balconies, air conditioner units, billboards, and glass curtain walls plays a pivotal role in numerous applications. Building facade attachments detection aids in vbuilding information modeling (BIM) construction and meeting Level of Detail 3 (LOD3) standards. Yet, it faces challenges like uneven object distribution, small object detection difficulty, and background interference. To counter these, we propose BFA-YOLO, a model for detecting facade attachments in multi-view images. BFA-YOLO incorporates three novel innovations: the Feature Balanced Spindle Module (FBSM) for addressing uneven distribution, the Target Dynamic Alignment Task Detection Head (TDATH) aimed at improving small object detection, and the Position Memory Enhanced Self-Attention Mechanism (PMESA) to combat background interference, with each component specifically designed to solve its corresponding challenge. Detection efficacy of deep network models deeply depends on the dataset’s characteristics. Existing open source datasets related to building facades are limited by their single perspective, small image pool, and incomplete category coverage. We propose a novel method for building facade attachments detection dataset construction and construct the BFA-3D dataset for facade attachments detection. The BFA-3D dataset features multi-view, accurate labels, diverse categories, and detailed classification. BFA-YOLO surpasses YOLOv8 by 1.8% and 2.9% in mAP@0.5 on the multi-view BFA-3D and street-view Facade-WHU datasets, respectively. These results underscore BFA-YOLO’s superior performance in detecting facade attachments.

[AI-42] Searching for Effective Preprocessing Method and CNN-based Architecture with Efficient Channel Attention on Speech Emotion Recognition

链接: https://arxiv.org/abs/2409.04007
作者: Byunggun Kim,Younghun Kwon
关键词-EN: classifies human emotions, SER, classifies human, Speech, model
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Speech emotion recognition (SER) classifies human emotions in speech with a computer model. Recently, performance in SER has steadily increased as deep learning techniques have adapted. However, unlike many domains that use speech data, data for training in the SER model is insufficient. This causes overfitting of training of the neural network, resulting in performance degradation. In fact, successful emotion recognition requires an effective preprocessing method and a model structure that efficiently uses the number of weight parameters. In this study, we propose using eight dataset versions with different frequency-time resolutions to search for an effective emotional speech preprocessing method. We propose a 6-layer convolutional neural network (CNN) model with efficient channel attention (ECA) to pursue an efficient model structure. In particular, the well-positioned ECA blocks can improve channel feature representation with only a few parameters. With the interactive emotional dyadic motion capture (IEMOCAP) dataset, increasing the frequency resolution in preprocessing emotional speech can improve emotion recognition performance. Also, ECA after the deep convolution layer can effectively increase channel feature representation. Consequently, the best result (79.37UA 79.68WA) can be obtained, exceeding the performance of previous SER models. Furthermore, to compensate for the lack of emotional speech data, we experiment with multiple preprocessing data methods that augment trainable data preprocessed with all different settings from one sample. In the experiment, we can achieve the highest result (80.28UA 80.46WA).

[AI-43] Confidential Computing on nVIDIA H100 GPU: A Performance Benchmark Study

链接: https://arxiv.org/abs/2409.03992
作者: Jianwei Zhu,Hang Yin,Shunfan Zhou
关键词-EN: Trusted Execution Environments, enabling Trusted Execution, Execution Environments, Trusted Execution, large language model
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:This report evaluates the performance impact of enabling Trusted Execution Environments (TEE) on NVIDIA H100 GPUs for large language model (LLM) inference tasks. We benchmark the overhead introduced by TEE mode across various models and token lengths, focusing on the bottleneck caused by CPU-GPU data transfers via PCIe. Our results show that while there is minimal computational overhead within the GPU, the overall performance penalty is primarily due to data transfer. For most typical LLM queries, the overhead remains below 5%, with larger models and longer sequences experiencing near-zero overhead.

[AI-44] FODA-PG for Enhanced Medical Imaging Narrative Generation: Adaptive Differentiation of Normal and Abnormal Attributes

链接: https://arxiv.org/abs/2409.03947
作者: Kai Shu,Yuzhuo Jia,Ziyang Zhang,Jiechao Gao
关键词-EN: Automatic Medical Imaging, Medical Imaging Narrative, Imaging Narrative generation, Imaging Narrative, Narrative generation aims
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Automatic Medical Imaging Narrative generation aims to alleviate the workload of radiologists by producing accurate clinical descriptions directly from radiological images. However, the subtle visual nuances and domain-specific terminology in medical images pose significant challenges compared to generic image captioning tasks. Existing approaches often neglect the vital distinction between normal and abnormal findings, leading to suboptimal performance. In this work, we propose FODA-PG, a novel Fine-grained Organ-Disease Adaptive Partitioning Graph framework that addresses these limitations through domain-adaptive learning. FODA-PG constructs a granular graphical representation of radiological findings by separating disease-related attributes into distinct “disease-specific” and “disease-free” categories based on their clinical significance and location. This adaptive partitioning enables our model to capture the nuanced differences between normal and pathological states, mitigating the impact of data biases. By integrating this fine-grained semantic knowledge into a powerful transformer-based architecture and providing rigorous mathematical justifications for its effectiveness, FODA-PG generates precise and clinically coherent reports with enhanced generalization capabilities. Extensive experiments on the IU-Xray and MIMIC-CXR benchmarks demonstrate the superiority of our approach over state-of-the-art methods, highlighting the importance of domain adaptation in medical report generation.

[AI-45] HUMOS: Human Motion Model Conditioned on Body Shape ECCV’24

链接: https://arxiv.org/abs/2409.03944
作者: Shashank Tripathi,Omid Taheri,Christoph Lassner,Michael J. Black,Daniel Holden,Carsten Stoll
关键词-EN: Generating realistic human, graphics applications, Generating realistic, computer vision, vision and graphics
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted in ECCV’24. Project page: this https URL

点击查看摘要

Abstract:Generating realistic human motion is essential for many computer vision and graphics applications. The wide variety of human body shapes and sizes greatly impacts how people move. However, most existing motion models ignore these differences, relying on a standardized, average body. This leads to uniform motion across different body types, where movements don’t match their physical characteristics, limiting diversity. To solve this, we introduce a new approach to develop a generative motion model based on body shape. We show that it’s possible to train this model using unpaired data by applying cycle consistency, intuitive physics, and stability constraints, which capture the relationship between identity and movement. The resulting model generates diverse, physically plausible, and dynamically stable human motions that are both quantitatively and qualitatively more realistic than current state-of-the-art methods. More details are available on our project page this https URL.

[AI-46] Harnessing LLMs for Cross-City OD Flow Prediction

链接: https://arxiv.org/abs/2409.03937
作者: Chenyang Yu,Xinpeng Xie,Yan Huang,Chenxi Qiu
关键词-EN: Large Language Models, transportation management, planning and transportation, flow prediction, employing Large Language
类目: Artificial Intelligence (cs.AI)
*备注: 12 pages, 18 figures

点击查看摘要

Abstract:Understanding and predicting Origin-Destination (OD) flows is crucial for urban planning and transportation management. Traditional OD prediction models, while effective within single cities, often face limitations when applied across different cities due to varied traffic conditions, urban layouts, and socio-economic factors. In this paper, by employing Large Language Models (LLMs), we introduce a new method for cross-city OD flow prediction. Our approach leverages the advanced semantic understanding and contextual learning capabilities of LLMs to bridge the gap between cities with different characteristics, providing a robust and adaptable solution for accurate OD flow prediction that can be transferred from one city to another. Our novel framework involves four major components: collecting OD training datasets from a source city, instruction-tuning the LLMs, predicting destination POIs in a target city, and identifying the locations that best match the predicted destination POIs. We introduce a new loss function that integrates POI semantics and trip distance during training. By extracting high-quality semantic features from human mobility and POI data, the model understands spatial and functional relationships within urban spaces and captures interactions between individuals and various POIs. Extensive experimental results demonstrate the superiority of our approach over the state-of-the-art learning-based methods in cross-city OD flow prediction.

[AI-47] he Role of Generative Systems in Historical Photography Management: A Case Study on Catalan Archives ECCV

链接: https://arxiv.org/abs/2409.03911
作者: Èric Śanchez,Adrià Molina,Oriol Ramos Terrades
关键词-EN: automated photography management, heritage institutions, image analysis, analysis in automated, automated photography
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at ECCV workshop AI4DH

点击查看摘要

Abstract:The use of image analysis in automated photography management is an increasing trend in heritage institutions. Such tools alleviate the human cost associated with the manual and expensive annotation of new data sources while facilitating fast access to the citizenship through online indexes and search engines. However, available tagging and description tools are usually designed around modern photographs in English, neglecting historical corpora in minoritized languages, each of which exhibits intrinsic particularities. The primary objective of this research is to study the quantitative contribution of generative systems in the description of historical sources. This is done by contextualizing the task of captioning historical photographs from the Catalan archives as a case study. Our findings provide practitioners with tools and directions on transfer learning for captioning models based on visual adaptation and linguistic proximity.

[AI-48] Multi-agent Path Finding for Mixed Autonomy Traffic Coordination

链接: https://arxiv.org/abs/2409.03881
作者: Han Zheng,Zhongxia Yan,Cathy Wu
关键词-EN: Automated Vehicles, Connected and Automated, autonomous driving systems, Priority Based Search, Human-Driven Vehicles
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:In the evolving landscape of urban mobility, the prospective integration of Connected and Automated Vehicles (CAVs) with Human-Driven Vehicles (HDVs) presents a complex array of challenges and opportunities for autonomous driving systems. While recent advancements in robotics have yielded Multi-Agent Path Finding (MAPF) algorithms tailored for agent coordination task characterized by simplified kinematics and complete control over agent behaviors, these solutions are inapplicable in mixed-traffic environments where uncontrollable HDVs must coexist and interact with CAVs. Addressing this gap, we propose the Behavior Prediction Kinematic Priority Based Search (BK-PBS), which leverages an offline-trained conditional prediction model to forecast HDV responses to CAV maneuvers, integrating these insights into a Priority Based Search (PBS) where the A* search proceeds over motion primitives to accommodate kinematic constraints. We compare BK-PBS with CAV planning algorithms derived by rule-based car-following models, and reinforcement learning. Through comprehensive simulation on a highway merging scenario across diverse scenarios of CAV penetration rate and traffic density, BK-PBS outperforms these baselines in reducing collision rates and enhancing system-level travel delay. Our work is directly applicable to many scenarios of multi-human multi-robot coordination.

[AI-49] Cost-Control in Display Advertising: Theory vs Practice

链接: https://arxiv.org/abs/2409.03874
作者: Anoop R Katti,Rui C. Gonçalves,Rinchin Iakovlev
关键词-EN: display advertising, dual variables, achieve a marketing, marketing objective, optimal bidding formula
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In display advertising, advertisers want to achieve a marketing objective with constraints on budget and cost-per-outcome. This is usually formulated as an optimization problem that maximizes the total utility under constraints. The optimization is carried out in an online fashion in the dual space - for an incoming Ad auction, a bid is placed using an optimal bidding formula, assuming optimal values for the dual variables; based on the outcome of the previous auctions, the dual variables are updated in an online fashion. While this approach is theoretically sound, in practice, the dual variables are not optimal from the beginning, but rather converge over time. Specifically, for the cost-constraint, the convergence is asymptotic. As a result, we find that cost-control is ineffective. In this work, we analyse the shortcomings of the optimal bidding formula and propose a modification that deviates from the theoretical derivation. We simulate various practical scenarios and study the cost-control behaviors of the two algorithms. Through a large-scale evaluation on the real-word data, we show that the proposed modification reduces the cost violations by 50%, thereby achieving a better cost-control than the theoretical bidding formula.

[AI-50] MetaBGM: Dynamic Soundtrack Transformation For Continuous Multi-Scene Experiences With Ambient Awareness And Personalization

链接: https://arxiv.org/abs/2409.03844
作者: Haoxuan Liu,Zihao Wang,Haorong Hong,Youwei Feng,Jiaxin Yu,Han Diao,Yunfei Xu,Kejun Zhang
关键词-EN: paper introduces MetaBGM, real-time user interactions, paper introduces, groundbreaking framework, framework for generating
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This paper introduces MetaBGM, a groundbreaking framework for generating background music that adapts to dynamic scenes and real-time user interactions. We define multi-scene as variations in environmental contexts, such as transitions in game settings or movie scenes. To tackle the challenge of converting backend data into music description texts for audio generation models, MetaBGM employs a novel two-stage generation approach that transforms continuous scene and user state data into these texts, which are then fed into an audio generation model for real-time soundtrack creation. Experimental results demonstrate that MetaBGM effectively generates contextually relevant and dynamic background music for interactive applications.

[AI-51] PARCO: Learning Parallel Autoregressive Policies for Efficient Multi-Agent Combinatorial Optimization

链接: https://arxiv.org/abs/2409.03811
作者: Federico Berto,Chuanbo Hua,Laurin Luttmann,Jiwoo Son,Junyoung Park,Kyuree Ahn,Changhyun Kwon,Lin Xie,Jinkyoo Park
关键词-EN: great practical relevance, present challenges due, NP-hard combinatorial nature, Multi-agent combinatorial optimization, multi-agent combinatorial problems
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multi-agent combinatorial optimization problems such as routing and scheduling have great practical relevance but present challenges due to their NP-hard combinatorial nature, hard constraints on the number of possible agents, and hard-to-optimize objective functions. This paper introduces PARCO (Parallel AutoRegressive Combinatorial Optimization), a novel approach that learns fast surrogate solvers for multi-agent combinatorial problems with reinforcement learning by employing parallel autoregressive decoding. We propose a model with a Multiple Pointer Mechanism to efficiently decode multiple decisions simultaneously by different agents, enhanced by a Priority-based Conflict Handling scheme. Moreover, we design specialized Communication Layers that enable effective agent collaboration, thus enriching decision-making. We evaluate PARCO in representative multi-agent combinatorial problems in routing and scheduling and demonstrate that our learned solvers offer competitive results against both classical and neural baselines in terms of both solution quality and speed. We make our code openly available at this https URL.

[AI-52] How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data

链接: https://arxiv.org/abs/2409.03810
作者: Yejie Wang,Keqing He,Dayuan Fu,Zhuoma Gongque,Heyang Xu,Yanxu Chen,Zhexu Wang,Yujia Fu,Guanting Dong,Muxi Diao,Jingang Wang,Mengdi Zhang,Xunliang Cai,Weiran Xu
关键词-EN: growing interest, interest in studying, Recently, data, code
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Working in progress

点击查看摘要

Abstract:Recently, there has been a growing interest in studying how to construct better code instruction tuning data. However, we observe Code models trained with these datasets exhibit high performance on HumanEval but perform worse on other benchmarks such as LiveCodeBench. Upon further investigation, we find that many datasets suffer from severe data leakage. After cleaning up most of the leaked data, some well-known high-quality datasets perform poorly. This discovery reveals a new challenge: identifying which dataset genuinely qualify as high-quality code instruction data. To address this, we propose an efficient code data pruning strategy for selecting good samples. Our approach is based on three dimensions: instruction complexity, response quality, and instruction diversity. Based on our selected data, we present XCoder, a family of models finetuned from LLaMA3. Our experiments show XCoder achieves new state-of-the-art performance using fewer training data, which verify the effectiveness of our data strategy. Moreover, we perform a comprehensive analysis on the data composition and find existing code datasets have different characteristics according to their construction methods, which provide new insights for future code LLMs. Our models and dataset are released in this https URL

[AI-53] NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls

链接: https://arxiv.org/abs/2409.03797
作者: Kinjal Basu,Ibrahim Abdelaziz,Kelsey Bradford,Maxwell Crouse,Kiran Kate,Sadhana Kumaravel,Saurabh Goyal,Asim Munawar,Yara Rizk,Xin Wang,Luis Lastras,Pavan Kapanipathi
关键词-EN: Autonomous agent applications, complex real-world tasks, Application Programming Interfaces, agent applications powered, addressing complex real-world
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Autonomous agent applications powered by large language models (LLMs) have recently risen to prominence as effective tools for addressing complex real-world tasks. At their core, agentic workflows rely on LLMs to plan and execute the use of tools and external Application Programming Interfaces (APIs) in sequence to arrive at the answer to a user’s request. Various benchmarks and leaderboards have emerged to evaluate an LLM’s capabilities for tool and API use; however, most of these evaluations only track single or multiple isolated API calling capabilities. In this paper, we present NESTFUL, a benchmark to evaluate LLMs on nested sequences of API calls, i.e., sequences where the output of one API call is passed as input to a subsequent call. NESTFUL has a total of 300 human annotated samples divided into two types - executable and non-executable. The executable samples are curated manually by crawling Rapid-APIs whereas the non-executable samples are hand picked by human annotators from data synthetically generated using an LLM. We evaluate state-of-the-art LLMs with function calling abilities on NESTFUL. Our results show that most models do not perform well on nested APIs in NESTFUL as compared to their performance on the simpler problem settings available in existing benchmarks.

[AI-54] Protecting Activity Sensing Data Privacy Using Hierarchical Information Dissociation

链接: https://arxiv.org/abs/2409.03796
作者: Guangjing Wang,Hanqing Guo,Yuanda Wang,Bocheng Chen,Ce Zhou,Qiben Yan
关键词-EN: offering personalized services, Smartphones and wearable, sensing data, information, daily lives
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Smartphones and wearable devices have been integrated into our daily lives, offering personalized services. However, many apps become overprivileged as their collected sensing data contains unnecessary sensitive information. For example, mobile sensing data could reveal private attributes (e.g., gender and age) and unintended sensitive features (e.g., hand gestures when entering passwords). To prevent sensitive information leakage, existing methods must obtain private labels and users need to specify privacy policies. However, they only achieve limited control over information disclosure. In this work, we present Hippo to dissociate hierarchical information including private metadata and multi-grained activity information from the sensing data. Hippo achieves fine-grained control over the disclosure of sensitive information without requiring private labels. Specifically, we design a latent guidance-based diffusion model, which generates multi-grained versions of raw sensor data conditioned on hierarchical latent activity features. Hippo enables users to control the disclosure of sensitive information in sensing data, ensuring their privacy while preserving the necessary features to meet the utility requirements of applications. Hippo is the first unified model that achieves two goals: perturbing the sensitive attributes and controlling the disclosure of sensitive information in mobile sensing data. Extensive experiments show that Hippo can anonymize personal attributes and transform activity information at various resolutions across different types of sensing data.

[AI-55] Security Implications and Mitigation Strategies in MPLS Networks

链接: https://arxiv.org/abs/2409.03795
作者: Ayush Thakur
关键词-EN: Multiprotocol Label Switching, high-performance telecommunications technology, short path labels, long network addresses, Multiprotocol Label
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Multiprotocol Label Switching (MPLS) is a high-performance telecommunications technology that directs data from one network node to another based on short path labels rather than long network addresses. Its efficiency and scalability have made it a popular choice for large-scale and enterprise networks. However, as MPLS networks grow and evolve, they encounter various security challenges. This paper explores the security implications associated with MPLS networks, including risks such as label spoofing, traffic interception, and denial of service attacks. Additionally, it evaluates advanced mitigation strategies to address these vulnerabilities, leveraging mathematical models and security protocols to enhance MPLS network resilience. By integrating theoretical analysis with practical solutions, this paper aims to provide a comprehensive understanding of MPLS security and propose effective methods for safeguarding network infrastructure.

[AI-56] Safeguarding AI Agents : Developing and Analyzing Safety Architectures

链接: https://arxiv.org/abs/2409.03793
作者: Ishaan Domkundwar,Mukunda N S
关键词-EN: large language models, demonstrated exceptional capabilities, specifically powered, language models, powered by large
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:AI agents, specifically powered by large language models, have demonstrated exceptional capabilities in various applications where precision and efficacy are necessary. However, these agents come with inherent risks, including the potential for unsafe or biased actions, vulnerability to adversarial attacks, lack of transparency, and tendency to generate hallucinations. As AI agents become more prevalent in critical sectors of the industry, the implementation of effective safety protocols becomes increasingly important. This paper addresses the critical need for safety measures in AI systems, especially ones that collaborate with human teams. We propose and evaluate three frameworks to enhance safety protocols in AI agent systems: an LLM-powered input-output filter, a safety agent integrated within the system, and a hierarchical delegation-based system with embedded safety checks. Our methodology involves implementing these frameworks and testing them against a set of unsafe agentic use cases, providing a comprehensive evaluation of their effectiveness in mitigating risks associated with AI agent deployment. We conclude that these frameworks can significantly strengthen the safety and security of AI agent systems, minimizing potential harmful actions or outputs. Our work contributes to the ongoing effort to create safe and reliable AI applications, particularly in automated operations, and provides a foundation for developing robust guardrails to ensure the responsible use of AI agents in real-world applications.

[AI-57] BreachSeek: A Multi-Agent Automated Penetration Tester

链接: https://arxiv.org/abs/2409.03789
作者: Ibrahim Alshehri,Adnan Alshehri,Abdulrahman Almalki,Majed Bamardouf,Alaqsa Akbar
关键词-EN: modern digital environments, exposed significant gaps, penetration testing methods, emerging threats, Large Language Models
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 7 pages, 6 figures

点击查看摘要

Abstract:The increasing complexity and scale of modern digital environments have exposed significant gaps in traditional cybersecurity penetration testing methods, which are often time-consuming, labor-intensive, and unable to rapidly adapt to emerging threats. There is a critical need for an automated solution that can efficiently identify and exploit vulnerabilities across diverse systems without extensive human intervention. BreachSeek addresses this challenge by providing an AI-driven multi-agent software platform that leverages Large Language Models (LLMs) integrated through LangChain and LangGraph in Python. This system enables autonomous agents to conduct thorough penetration testing by identifying vulnerabilities, simulating a variety of cyberattacks, executing exploits, and generating comprehensive security reports. In preliminary evaluations, BreachSeek successfully exploited vulnerabilities in exploitable machines within local networks, demonstrating its practical effectiveness. Future developments aim to expand its capabilities, positioning it as an indispensable tool for cybersecurity professionals.

[AI-58] HSF: Defending against Jailbreak Attacks with Hidden State Filtering

链接: https://arxiv.org/abs/2409.03788
作者: Cheng Qian,Hainan Zhang,Lei Sha,Zhiming Zheng
关键词-EN: ensure outputs align, avoid harmful content, LLM hidden state, content generation, jailbreak attacks
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 13 pages

点击查看摘要

Abstract:With the growing deployment of LLMs in daily applications like chatbots and content generation, efforts to ensure outputs align with human values and avoid harmful content have intensified. However, increasingly sophisticated jailbreak attacks threaten this alignment, aiming to induce unsafe outputs. Current defense efforts either focus on prompt rewriting or detection, which are limited in effectiveness due to the various design of jailbreak prompts, or on output control and detection, which are computationally expensive as they require LLM inference. Therefore, designing a pre-inference defense method that resists diverse jailbreak prompts is crucial for preventing LLM jailbreak attacks. We observe that jailbreak attacks, safe queries, and harmful queries exhibit different clustering patterns within the LLM’s hidden state representation space. This suggests that by leveraging the LLM’s hidden state representational capabilities, we can analyze the LLM’s forthcoming behavior and proactively intervene for defense. In this paper, we propose a jailbreak attack defense strategy based on a Hidden State Filter (HSF), a lossless architectural defense mechanism that enables the model to preemptively identify and reject adversarial inputs before the inference process begins. We activate its defensive potential through an additional plugin module, effectively framing the defense task as a classification problem. Experimental results on two benchmark datasets, utilizing three different LLMs, show that HSF significantly enhances resilience against six cutting-edge jailbreak attacks. It significantly reduces the success rate of jailbreak attacks while minimally impacting responses to benign user queries, with negligible inference overhead, and outperforming defense baselines.Our code and data are available at https://anonymous.4open.science/r/Hidden-State-Filtering-8652/

[AI-59] VERA: Validation and Evaluation of Retrieval-Augmented Systems KDD2024

链接: https://arxiv.org/abs/2409.03759
作者: Tianyu Ding,Adi Banerjee,Laurent Mombaerts,Yunhong Li,Tarik Borogovac,Juan Pablo De la Cruz Weinstein
关键词-EN: necessitates stringent protocols, RAG systems accuracy, ensure RAG systems, applications necessitates stringent, Retrieval-Augmented Generation
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Accepted in Workshop on Evaluation and Trustworthiness of Generative AI Models, KDD 2024

点击查看摘要

Abstract:The increasing use of Retrieval-Augmented Generation (RAG) systems in various applications necessitates stringent protocols to ensure RAG systems accuracy, safety, and alignment with user intentions. In this paper, we introduce VERA (Validation and Evaluation of Retrieval-Augmented Systems), a framework designed to enhance the transparency and reliability of outputs from large language models (LLMs) that utilize retrieved information. VERA improves the way we evaluate RAG systems in two important ways: (1) it introduces a cross-encoder based mechanism that encompasses a set of multidimensional metrics into a single comprehensive ranking score, addressing the challenge of prioritizing individual metrics, and (2) it employs Bootstrap statistics on LLM-based metrics across the document repository to establish confidence bounds, ensuring the repositorys topical coverage and improving the overall reliability of retrieval systems. Through several use cases, we demonstrate how VERA can strengthen decision-making processes and trust in AI applications. Our findings not only contribute to the theoretical understanding of LLM-based RAG evaluation metric but also promote the practical implementation of responsible AI systems, marking a significant advancement in the development of reliable and transparent generative AI technologies.

[AI-60] he Impact of Scanner Domain Shift on Deep Learning Performance in Medical Imaging: an Experimental Study

链接: https://arxiv.org/abs/2409.04368
作者: Gregory Szumel,Brian Guo,Darui Lu,Rongze Gui,Tingyu Wang,Nicholas Konz,Maciej A. Mazurowski
关键词-EN: scanner domain shift, protocols can differ, differ substantially, Purpose, scanner domain
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Purpose: Medical images acquired using different scanners and protocols can differ substantially in their appearance. This phenomenon, scanner domain shift, can result in a drop in the performance of deep neural networks which are trained on data acquired by one scanner and tested on another. This significant practical issue is well-acknowledged, however, no systematic study of the issue is available across different modalities and diagnostic tasks. Materials and Methods: In this paper, we present a broad experimental study evaluating the impact of scanner domain shift on convolutional neural network performance for different automated diagnostic tasks. We evaluate this phenomenon in common radiological modalities, including X-ray, CT, and MRI. Results: We find that network performance on data from a different scanner is almost always worse than on same-scanner data, and we quantify the degree of performance drop across different datasets. Notably, we find that this drop is most severe for MRI, moderate for X-ray, and quite small for CT, on average, which we attribute to the standardized nature of CT acquisition systems which is not present in MRI or X-ray. We also study how injecting varying amounts of target domain data into the training set, as well as adding noise to the training data, helps with generalization. Conclusion: Our results provide extensive experimental evidence and quantification of the extent of performance drop caused by scanner domain shift in deep learning across different modalities, with the goal of guiding the future development of robust deep learning models for medical image analysis.

[AI-61] A deep learning approach to wall-shear stress quantification: From numerical training to zero-shot experimental application

链接: https://arxiv.org/abs/2409.03933
作者: Esther Lagemann,Julia Roeb,Steven L. Brunton,Christian Lagemann
关键词-EN: wall-shear stress dynamics, wall-shear stress, stress dynamics, applied research, spanning areas
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The accurate quantification of wall-shear stress dynamics is of substantial importance for various applications in fundamental and applied research, spanning areas from human health to aircraft design and optimization. Despite significant progress in experimental measurement techniques and post-processing algorithms, temporally resolved wall-shear stress dynamics with adequate spatial resolution and within a suitable spatial domain remain an elusive goal. To address this gap, we introduce a deep learning architecture that ingests wall-parallel velocity fields from the logarithmic layer of turbulent wall-bounded flows and outputs the corresponding 2D wall-shear stress fields with identical spatial resolution and domain size. From a physical perspective, our framework acts as a surrogate model encapsulating the various mechanisms through which highly energetic outer-layer flow structures influence the governing wall-shear stress dynamics. The network is trained in a supervised fashion on a unified dataset comprising direct numerical simulations of statistically 1D turbulent channel and spatially developing turbulent boundary layer flows at friction Reynolds numbers ranging from 390 to 1,500. We demonstrate a zero-shot applicability to experimental velocity fields obtained from Particle-Image Velocimetry measurements and verify the physical accuracy of the wall-shear stress estimates with synchronized wall-shear stress measurements using the Micro-Pillar Shear-Stress Sensor for Reynolds numbers up to 2,000. In summary, the presented framework lays the groundwork for extracting inaccessible experimental wall-shear stress information from readily available velocity measurements and thus, facilitates advancements in a variety of experimental applications.

[AI-62] AI forecasting of higher-order wave modes of spinning binary black hole mergers

链接: https://arxiv.org/abs/2409.03833
作者: Victoria Tiki,Kiet Pham,Eliu Huerta
关键词-EN: non-precessing binary black, binary black hole, higher-order wave modes, wave modes emitted, black hole mergers
类目: General Relativity and Quantum Cosmology (gr-qc); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
*备注: 27 pages, 1 appendix, 10 figures

点击查看摘要

Abstract:We present a physics-inspired transformer model that predicts the non-linear dynamics of higher-order wave modes emitted by quasi-circular, spinning, non-precessing binary black hole mergers. The model forecasts the waveform evolution from the pre-merger phase through the ringdown, starting with an input time-series spanning t \in [-5000\textrmM, -100\textrmM) . The merger event, defined as the peak amplitude of waveforms that include the l = |m| = 2 modes, occurs at t = 0\textrmM . The transformer then generates predictions over the time range t \in [-100\textrmM, 130\textrmM] . We produced training, evaluation and test sets using the NRHybSur3dq8 model, considering a signal manifold defined by mass ratios q \in [1, 8] ; spin components s^z_\1,2\ \in [-0.8, 0.8] ; modes up to l \leq 4 , including the (5,5) mode but excluding the (4,0) and (4,1) modes; and inclination angles \theta \in [0, \pi] . We trained the model on 14,440,761 waveforms, completing the training in 15 hours using 16 NVIDIA A100 GPUs in the Delta supercomputer. We used 4 H100 GPUs in the DeltaAI supercomputer to compute, within 7 hours, the overlap between ground truth and predicted waveforms using a test set of 840,000 waveforms, finding that the mean and median overlaps over the test set are 0.996 and 0.997, respectively. Additionally, we conducted interpretability studies to elucidate the waveform features utilized by our transformer model to produce accurate predictions. The scientific software used for this work is released with this manuscript.

[AI-63] Mpox Screen Lite: AI-Driven On-Device Offline Mpox Screening for Low-Resource African Mpox Emergency Response

链接: https://arxiv.org/abs/2409.03806
作者: Yudara Kularathne,Prathapa Janitha,Sithira Ambepitiya
关键词-EN: highlighted critical gaps, Africa with clade, severe in Africa, Mpox, highlighted critical
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 Pages, 2 Figures, 3 Tables

点击查看摘要

Abstract:Background: The 2024 Mpox outbreak, particularly severe in Africa with clade 1b emergence, has highlighted critical gaps in diagnostic capabilities in resource-limited settings. This study aimed to develop and validate an artificial intelligence (AI)-driven, on-device screening tool for Mpox, designed to function offline in low-resource environments. Methods: We developed a YOLOv8n-based deep learning model trained on 2,700 images (900 each of Mpox, other skin conditions, and normal skin), including synthetic data. The model was validated on 360 images and tested on 540 images. A larger external validation was conducted using 1,500 independent images. Performance metrics included accuracy, precision, recall, F1-score, sensitivity, and specificity. Findings: The model demonstrated high accuracy (96%) in the final test set. For Mpox detection, it achieved 93% precision, 97% recall, and an F1-score of 95%. Sensitivity and specificity for Mpox detection were 97% and 96%, respectively. Performance remained consistent in the larger external validation, confirming the model’s robustness and generalizability. Interpretation: This AI-driven screening tool offers a rapid, accurate, and scalable solution for Mpox detection in resource-constrained settings. Its offline functionality and high performance across diverse datasets suggest significant potential for improving Mpox surveillance and management, particularly in areas lacking traditional diagnostic infrastructure. Comments: 11 Pages, 2 Figures, 3 Tables Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.03806 [eess.IV] (or arXiv:2409.03806v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2409.03806 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Yudara Kularathne [view email] [v1] Thu, 5 Sep 2024 11:18:34 UTC (331 KB)

[AI-64] Exploratory Visual Analysis for Increasing Data Readiness in Artificial Intelligence Projects

链接: https://arxiv.org/abs/2409.03805
作者: Mattias Tiger,Daniel Jakobsson,Anders Ynnerman,Fredrik Heintz,Daniel Jönsson
关键词-EN: visual analysis methods, data readiness, visual analysis, visual analysis techniques, data readiness level
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present experiences and lessons learned from increasing data readiness of heterogeneous data for artificial intelligence projects using visual analysis methods. Increasing the data readiness level involves understanding both the data as well as the context in which it is used, which are challenges well suitable to visual analysis. For this purpose, we contribute a mapping between data readiness aspects and visual analysis techniques suitable for different data types. We use the defined mapping to increase data readiness levels in use cases involving time-varying data, including numerical, categorical, and text. In addition to the mapping, we extend the data readiness concept to better take aspects of the task and solution into account and explicitly address distribution shifts during data collection time. We report on our experiences in using the presented visual analysis techniques to aid future artificial intelligence projects in raising the data readiness level.

计算机视觉

[CV-0] Synergy and Synchrony in Couple Dances

链接: https://arxiv.org/abs/2409.04440
作者: Vongani Maluleke,Lea Müller,Jathushan Rajasegaran,Georgios Pavlakos,Shiry Ginosar,Angjoo Kanazawa,Jitendra Malik
关键词-EN: extent social interaction, social interaction influences, extent social, motion, interaction influences
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper asks to what extent social interaction influences one’s behavior. We study this in the setting of two dancers dancing as a couple. We first consider a baseline in which we predict a dancer’s future moves conditioned only on their past motion without regard to their partner. We then investigate the advantage of taking social information into account by conditioning also on the motion of their dancing partner. We focus our analysis on Swing, a dance genre with tight physical coupling for which we present an in-the-wild video dataset. We demonstrate that single-person future motion prediction in this context is challenging. Instead, we observe that prediction greatly benefits from considering the interaction partners’ behavior, resulting in surprisingly compelling couple dance synthesis results (see supp. video). Our contributions are a demonstration of the advantages of socially conditioned future motion prediction and an in-the-wild, couple dance video dataset to enable future research in this direction. Video results are available on the project website: this https URL

[CV-1] VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

链接: https://arxiv.org/abs/2409.04429
作者: Yecheng Wu,Zhuoyang Zhang,Junyu Chen,Haotian Tang,Dacheng Li,Yunhao Fang,Ligeng Zhu,Enze Xie,Hongxu Yin,Li Yi,Song Han,Yao Lu
关键词-EN: integrates Video, Language understanding, visual language understanding, Video, Unified foundation model
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 11 pages, 7 figures, 8 tables

点击查看摘要

Abstract:VILA-U is a Unified foundation model that integrates Video, Image, Language understanding and generation. Traditional visual language models (VLMs) use separate modules for understanding and generating visual content, which can lead to misalignment and increased complexity. In contrast, VILA-U employs a single autoregressive next-token prediction framework for both tasks, eliminating the need for additional components like diffusion models. This approach not only simplifies the model but also achieves near state-of-the-art performance in visual language understanding and generation. The success of VILA-U is attributed to two main factors: the unified vision tower that aligns discrete visual tokens with textual inputs during pretraining, which enhances visual perception, and autoregressive image generation can achieve similar quality as diffusion models with high-quality dataset. This allows VILA-U to perform comparably to more complex models using a fully token-based autoregressive framework.

[CV-2] Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation

链接: https://arxiv.org/abs/2409.04410
作者: Zhuoyan Luo,Fengyuan Shi,Yixiao Ge,Yujiu Yang,Limin Wang,Ying Shan
关键词-EN: auto-regressive image generation, image generation models, generation models ranging, replication of Google, auto-regressive image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present Open-MAGVIT2, a family of auto-regressive image generation models ranging from 300M to 1.5B. The Open-MAGVIT2 project produces an open-source replication of Google’s MAGVIT-v2 tokenizer, a tokenizer with a super-large codebook (i.e., 2^18 codes), and achieves the state-of-the-art reconstruction performance (1.17 rFID) on ImageNet 256 \times 256 . Furthermore, we explore its application in plain auto-regressive models and validate scalability properties. To assist auto-regressive models in predicting with a super-large vocabulary, we factorize it into two sub-vocabulary of different sizes by asymmetric token factorization, and further introduce “next sub-token prediction” to enhance sub-token interaction for better generation quality. We release all models and codes to foster innovation and creativity in the field of auto-regressive visual generation.

[CV-3] rain Till You Drop: Towards Stable and Robust Source-free Unsupervised 3D Domain Adaptation ECCV2024

链接: https://arxiv.org/abs/2409.04409
作者: Björn Michele,Alexandre Boulch,Tuan-Hung Vu,Gilles Puy,Renaud Marlet,Nicolas Courty
关键词-EN: source-free unsupervised domain, unsupervised domain adaptation, semantic segmentation, tackle the challenging, source-free unsupervised
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024. Project repository: this http URL

点击查看摘要

Abstract:We tackle the challenging problem of source-free unsupervised domain adaptation (SFUDA) for 3D semantic segmentation. It amounts to performing domain adaptation on an unlabeled target domain without any access to source data; the available information is a model trained to achieve good performance on the source domain. A common issue with existing SFUDA approaches is that performance degrades after some training time, which is a by product of an under-constrained and ill-posed problem. We discuss two strategies to alleviate this issue. First, we propose a sensible way to regularize the learning problem. Second, we introduce a novel criterion based on agreement with a reference model. It is used (1) to stop the training when appropriate and (2) as validator to select hyperparameters without any knowledge on the target domain. Our contributions are easy to implement and readily amenable for all SFUDA methods, ensuring stable improvements over all baselines. We validate our findings on various 3D lidar settings, achieving state-of-the-art performance. The project repository (with code) is: this http URL.

[CV-4] HiSC4D: Human-centered interaction and 4D Scene Capture in Large-scale Space Using Wearable IMUs and LiDAR

链接: https://arxiv.org/abs/2409.04398
作者: Yudi Dai,Zhiyong Wang,Xiping Lin,Chenglu Wen,Lan Xu,Siqi Shen,Yuexin Ma,Cheng Wang
关键词-EN: rich human-human interactions, large-scale indoor-outdoor scenes, dynamic digital world, diverse human motions, Scene Capture method
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Multimedia (cs.MM)
*备注: 17 pages, 10 figures, Jornal

点击查看摘要

Abstract:We introduce HiSC4D, a novel Human-centered interaction and 4D Scene Capture method, aimed at accurately and efficiently creating a dynamic digital world, containing large-scale indoor-outdoor scenes, diverse human motions, rich human-human interactions, and human-environment interactions. By utilizing body-mounted IMUs and a head-mounted LiDAR, HiSC4D can capture egocentric human motions in unconstrained space without the need for external devices and pre-built maps. This affords great flexibility and accessibility for human-centered interaction and 4D scene capturing in various environments. Taking into account that IMUs can capture human spatially unrestricted poses but are prone to drifting for long-period using, and while LiDAR is stable for global localization but rough for local positions and orientations, HiSC4D employs a joint optimization method, harmonizing all sensors and utilizing environment cues, yielding promising results for long-term capture in large scenes. To promote research of egocentric human interaction in large scenes and facilitate downstream tasks, we also present a dataset, containing 8 sequences in 4 large scenes (200 to 5,000 m^2 ), providing 36k frames of accurate 4D human motions with SMPL annotations and dynamic scenes, 31k frames of cropped human point clouds, and scene mesh of the environment. A variety of scenarios, such as the basketball gym and commercial street, alongside challenging human motions, such as daily greeting, one-on-one basketball playing, and tour guiding, demonstrate the effectiveness and the generalization ability of HiSC4D. The dataset and code will be publicated on this http URL available for research purposes.

[CV-5] Future Does Matter: Boosting 3D Object Detection with Temporal Motion Estimation in Point Cloud Sequences

链接: https://arxiv.org/abs/2409.04390
作者: Rui Yu,Runkai Zhao,Cong Nie,Heng Wang,HuaiCheng Yan,Meng Wang
关键词-EN: Accurate and robust, comprehensive scene understanding, essential for comprehensive, understanding in autonomous, Accurate
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Accurate and robust LiDAR 3D object detection is essential for comprehensive scene understanding in autonomous driving. Despite its importance, LiDAR detection performance is limited by inherent constraints of point cloud data, particularly under conditions of extended distances and occlusions. Recently, temporal aggregation has been proven to significantly enhance detection accuracy by fusing multi-frame viewpoint information and enriching the spatial representation of objects. In this work, we introduce a novel LiDAR 3D object detection framework, namely LiSTM, to facilitate spatial-temporal feature learning with cross-frame motion forecasting information. We aim to improve the spatial-temporal interpretation capabilities of the LiDAR detector by incorporating a dynamic prior, generated from a non-learnable motion estimation model. Specifically, Motion-Guided Feature Aggregation (MGFA) is proposed to utilize the object trajectory from previous and future motion states to model spatial-temporal correlations into gaussian heatmap over a driving sequence. This motion-based heatmap then guides the temporal feature fusion, enriching the proposed object features. Moreover, we design a Dual Correlation Weighting Module (DCWM) that effectively facilitates the interaction between past and prospective frames through scene- and channel-wise feature abstraction. In the end, a cascade cross-attention-based decoder is employed to refine the 3D prediction. We have conducted experiments on the Waymo and nuScenes datasets to demonstrate that the proposed framework achieves superior 3D detection performance with effective spatial-temporal feature learning.

[CV-6] Question-Answering Dense Video Events

链接: https://arxiv.org/abs/2409.04388
作者: Hangyu Qin,Junbin Xiao,Angela Yao
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, Large Language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown excellent performance in question-answering of single-event videos. In this paper, we present question-answering dense video events, a novel task that requires answering and grounding the dense-event questions in long videos, thus challenging MLLMs to faithfully comprehend and reason about multiple events occurring over extended time periods. To facilitate the study, we construct DeVE-QA - a dataset featuring 78K questions about 26K events on 10.6K long videos. We then benchmark and show that existing MLLMs excelling at single-event QA struggle to perform well in DeVE-QA. For improvement, we propose DeVi, a novel training-free MLLM approach that highlights a hierarchical captioning module, a temporal event memory module, and a self-consistency checking module to respectively detect, contextualize and memorize, and ground dense-events in long videos for question answering. Extensive experiments show that DeVi is superior at answering dense-event questions and grounding relevant video moments. Compared with existing MLLMs, it achieves a remarkable increase of 4.1 percent and 3.7 percent for G(round)QA accuracy on DeVE-QA and NExT-GQA respectively.

[CV-7] Empirical Bayesian image restoration by Langevin sampling with a denoising diffusion implicit prior

链接: https://arxiv.org/abs/2409.04384
作者: Charlesquin Kemajou Mbakam,Jean-Francois Giovannelli,Marcelo Pereyra
关键词-EN: Score-based diffusion methods, Score-based diffusion, pre-trained foundational prior, diffusion methods provide, provide a powerful
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Image and Video Processing (eess.IV)
*备注: 24 pages

点击查看摘要

Abstract:Score-based diffusion methods provide a powerful strategy to solve image restoration tasks by flexibly combining a pre-trained foundational prior model with a likelihood function specified during test time. Such methods are predominantly derived from two stochastic processes: reversing Ornstein-Uhlenbeck, which underpins the celebrated denoising diffusion probabilistic models (DDPM) and denoising diffusion implicit models (DDIM), and the Langevin diffusion process. The solutions delivered by DDPM and DDIM are often remarkably realistic, but they are not always consistent with measurements because of likelihood intractability issues and the associated required approximations. Alternatively, using a Langevin process circumvents the intractable likelihood issue, but usually leads to restoration results of inferior quality and longer computing times. This paper presents a novel and highly computationally efficient image restoration method that carefully embeds a foundational DDPM denoiser within an empirical Bayesian Langevin algorithm, which jointly calibrates key model hyper-parameters as it estimates the model’s posterior mean. Extensive experimental results on three canonical tasks (image deblurring, super-resolution, and inpainting) demonstrate that the proposed approach improves on state-of-the-art strategies both in image estimation accuracy and computing time.

[CV-8] Enhancing Skin Lesion Diagnosis with Ensemble Learning

链接: https://arxiv.org/abs/2409.04381
作者: Xiaoyi Liu,Zhou Yu,Lianghao Tan,Yafeng Yan,Ge Shi
关键词-EN: significant medical concern, increasingly significant medical, medical concern, varying widely, benign to cancerous
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Skin lesions are an increasingly significant medical concern, varying widely in severity from benign to cancerous. Accurate diagnosis is essential for ensuring timely and appropriate treatment. This study examines the implementation of deep learning methods to assist in the diagnosis of skin lesions using the HAM10000 dataset, which contains seven distinct types of lesions. First, we evaluated three pre-trained models: MobileNetV2, ResNet18, and VGG11, achieving accuracies of 0.798, 0.802, and 0.805, respectively. To further enhance classification accuracy, we developed ensemble models employing max voting, average voting, and stacking, resulting in accuracies of 0.803, 0.82, and 0.83. Building on the best-performing ensemble learning model, stacking, we developed our proposed model, SkinNet, which incorporates a customized architecture and fine-tuning, achieving an accuracy of 0.867 and an AUC of 0.96. This substantial improvement over individual models demonstrates the effectiveness of ensemble learning in improving skin lesion classification.

[CV-9] RCNet: Deep Recurrent Collaborative Network for Multi-View Low-Light Image Enhancement

链接: https://arxiv.org/abs/2409.04363
作者: Hao Luo,Baoliang Chen,Lingyu Zhu,Peilin Chen,Shiqi Wang
关键词-EN: comprehensive visual experience, visual experience, perspectives would bring, comprehensive visual, Scene observation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 Pages, 10 Figures, Under Review

点击查看摘要

Abstract:Scene observation from multiple perspectives would bring a more comprehensive visual experience. However, in the context of acquiring multiple views in the dark, the highly correlated views are seriously alienated, making it challenging to improve scene understanding with auxiliary views. Recent single image-based enhancement methods may not be able to provide consistently desirable restoration performance for all views due to the ignorance of potential feature correspondence among different views. To alleviate this issue, we make the first attempt to investigate multi-view low-light image enhancement. First, we construct a new dataset called Multi-View Low-light Triplets (MVLT), including 1,860 pairs of triple images with large illumination ranges and wide noise distribution. Each triplet is equipped with three different viewpoints towards the same scene. Second, we propose a deep multi-view enhancement framework based on the Recurrent Collaborative Network (RCNet). Specifically, in order to benefit from similar texture correspondence across different views, we design the recurrent feature enhancement, alignment and fusion (ReEAF) module, in which intra-view feature enhancement (Intra-view EN) followed by inter-view feature alignment and fusion (Inter-view AF) is performed to model the intra-view and inter-view feature propagation sequentially via multi-view collaboration. In addition, two different modules from enhancement to alignment (E2A) and from alignment to enhancement (A2E) are developed to enable the interactions between Intra-view EN and Inter-view AF, which explicitly utilize attentive feature weighting and sampling for enhancement and alignment, respectively. Experimental results demonstrate that our RCNet significantly outperforms other state-of-the-art methods. All of our dataset, code, and model will be available at this https URL.

[CV-10] Connectivity-Inspired Network for Context-Aware Recognition ECCV2024

链接: https://arxiv.org/abs/2409.04360
作者: Gianluca Carloni,Sara Colantonio
关键词-EN: paper is threefold, Contextual Attention Block, human visual system, extensive literature review, motivated neural network
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注: ECCV 2024 - HCV Workshop, Accepted for presentation, Submitted Manuscript Version (adapted to include author names, Acknowledgements, and reference DOIs): the version of the manuscript improved after peer review will appear in the Proceedings later

点击查看摘要

Abstract:The aim of this paper is threefold. We inform the AI practitioner about the human visual system with an extensive literature review; we propose a novel biologically motivated neural network for image classification; and, finally, we present a new plug-and-play module to model context awareness. We focus on the effect of incorporating circuit motifs found in biological brains to address visual recognition. Our convolutional architecture is inspired by the connectivity of human cortical and subcortical streams, and we implement bottom-up and top-down modulations that mimic the extensive afferent and efferent connections between visual and cognitive areas. Our Contextual Attention Block is simple and effective and can be integrated with any feed-forward neural network. It infers weights that multiply the feature maps according to their causal influence on the scene, modeling the co-occurrence of different objects in the image. We place our module at different bottlenecks to infuse a hierarchical context awareness into the model. We validated our proposals through image classification experiments on benchmark data and found a consistent improvement in performance and the robustness of the produced explanations via class activation. Our code is available at this https URL.

[CV-11] Serp-Mamba: Advancing High-Resolution Retinal Vessel Segmentation with Selective State-Space Model

链接: https://arxiv.org/abs/2409.04356
作者: Hongqiu Wang,Yixian Chen,Wu Chen,Huihui Xu,Haoyu Zhao,Bin Sheng,Huazhu Fu,Guang Yang,Lei Zhu
关键词-EN: Scanning Laser Ophthalmoscopy, Scanning Laser, Laser Ophthalmoscopy, State Space Model, spanning degrees
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Ultra-Wide-Field Scanning Laser Ophthalmoscopy (UWF-SLO) images capture high-resolution views of the retina with typically 200 spanning degrees. Accurate segmentation of vessels in UWF-SLO images is essential for detecting and diagnosing fundus disease. Recent studies have revealed that the selective State Space Model (SSM) in Mamba performs well in modeling long-range dependencies, which is crucial for capturing the continuity of elongated vessel structures. Inspired by this, we propose the first Serpentine Mamba (Serp-Mamba) network to address this challenging task. Specifically, we recognize the intricate, varied, and delicate nature of the tubular structure of vessels. Furthermore, the high-resolution of UWF-SLO images exacerbates the imbalance between the vessel and background categories. Based on the above observations, we first devise a Serpentine Interwoven Adaptive (SIA) scan mechanism, which scans UWF-SLO images along curved vessel structures in a snake-like crawling manner. This approach, consistent with vascular texture transformations, ensures the effective and continuous capture of curved vascular structure features. Second, we propose an Ambiguity-Driven Dual Recalibration (ADDR) module to address the category imbalance problem intensified by high-resolution images. Our ADDR module delineates pixels by two learnable thresholds and refines ambiguous pixels through a dual-driven strategy, thereby accurately distinguishing vessels and background regions. Experiment results on three datasets demonstrate the superior performance of our Serp-Mamba on high-resolution vessel segmentation. We also conduct a series of ablation studies to verify the impact of our designs. Our code shall be released upon publication of this work.

[CV-12] Computer-Generated Sand Mixtures and Sand-based Images

链接: https://arxiv.org/abs/2409.04345
作者: Ryan A. Subong,Alma Jean D. Subong
关键词-EN: creating computer-generated images, computer-generated sand-based images, verify image reproduction, computer-generated sand-based, aims to verify
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 12 pages, 8 figures, 2nd International Research Conference on Computer Engineering and Technology Education

点击查看摘要

Abstract:This paper aims to verify the effectiveness of the software implementation of the proposed algorithm in creating computer-generated images of sand mixtures using a photograph of sand as an input and its effectiveness in converting digital pictures into sand-based images out of the mixtures it generated. The method of this paper is to visually compare the photographed image of the actual mixtures to its computer-generated counterpart to verify if the mixture generation produces results as expected and compare the computer-generated sand-based images with its source to verify image reproduction maintains same image content. The results of the mixture comparison shows that the actual and the computer-generated ones have similar overall shade and color. Still, the generated one has a rougher texture and higher contrast due to the method of inheriting visual features by pixel, not by individual sand particles. The comparison of the sand-based image and its source has demonstrated the software’s ability to maintain the essence of its contents during conversion while replacing its texture with the visual properties of the generated sand mixture. The result have shown that the software implementation of the proposed algorithm can effectively use the images of sand to generate images of its mixtures and use those mixture images to convert a digital picture into a computer-generated sand-based image.

[CV-13] How to Identify Good Superpixels for Deforestation Detection on Tropical Rainforests

链接: https://arxiv.org/abs/2409.04330
作者: Isabela Borlido,Eduardo Bouhid,Victor Sundermann,Hugo Resende,Alvaro Luiz Fazenda,Fabio Faria,Silvio Jamil F. Guimarães
关键词-EN: ecological relevance due, global ecosystem, topic of significant, significant social, social and ecological
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 3 figures, paper accepted for publication at the IEEE GRSL

点击查看摘要

Abstract:The conservation of tropical forests is a topic of significant social and ecological relevance due to their crucial role in the global ecosystem. Unfortunately, deforestation and degradation impact millions of hectares annually, requiring government or private initiatives for effective forest monitoring. However, identifying deforested regions in satellite images is challenging due to data imbalance, image resolution, low-contrast regions, and occlusion. Superpixel segmentation can overcome these drawbacks, reducing workload and preserving important image boundaries. However, most works for remote sensing images do not exploit recent superpixel methods. In this work, we evaluate 16 superpixel methods in satellite images to support a deforestation detection system in tropical forests. We also assess the performance of superpixel methods for the target task, establishing a relationship with segmentation methodological evaluation. According to our results, ERS, GMMSP, and DISF perform best on UE, BR, and SIRS, respectively, whereas ERS has the best trade-off with CO and Reg. In classification, SH, DISF, and ISF perform best on RGB, UMDA, and PCA compositions, respectively. According to our experiments, superpixel methods with better trade-offs between delineation, homogeneity, compactness, and regularity are more suitable for identifying good superpixels for deforestation detection tasks.

[CV-14] Advancing SEM Based Nano-Scale Defect Analysis in Semiconductor Manufacturing for Advanced IC Nodes ECCV2024

链接: https://arxiv.org/abs/2409.04310
作者: Bappaditya Dey,Matthias Monden,Victor Blanco,Sandip Halder,Stefan De Gendt
关键词-EN: Automated Defect, segmenting multiple instances, defect detection module, introduce a unified, segmenting multiple
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in ECCV 2024 2nd workshop on Vision-based InduStrial InspectiON (VISION)

点击查看摘要

Abstract:In this research, we introduce a unified end-to-end Automated Defect Classification-Detection-Segmentation (ADCDS) framework for classifying, detecting, and segmenting multiple instances of semiconductor defects for advanced nodes. This framework consists of two modules: (a) a defect detection module, followed by (b) a defect segmentation module. The defect detection module employs Deformable DETR to aid in the classification and detection of nano-scale defects, while the segmentation module utilizes BoxSnake. BoxSnake facilitates box-supervised instance segmentation of nano-scale defects, supported by the former module. This simplifies the process by eliminating the laborious requirement for ground-truth pixel-wise mask annotation by human experts, which is typically associated with training conventional segmentation models. We have evaluated the performance of our ADCDS framework using two distinct process datasets from real wafers, as ADI and AEI, specifically focusing on Line-space patterns. We have demonstrated the applicability and significance of our proposed methodology, particularly in the nano-scale segmentation and generation of binary defect masks, using the challenging ADI SEM dataset where ground-truth pixelwise segmentation annotations were unavailable. Furthermore, we have presented a comparative analysis of our proposed framework against previous approaches to demonstrate its effectiveness. Our proposed framework achieved an overall mAP@IoU0.5 of 72.19 for detection and 78.86 for segmentation on the ADI dataset. Similarly, for the AEI dataset, these metrics were 90.38 for detection and 95.48 for segmentation. Thus, our proposed framework effectively fulfils the requirements of advanced defect analysis while addressing significant constraints.

[CV-15] FS-MedSAM2: Exploring the Potential of SAM2 for Few-Shot Medical Image Segmentation without Fine-tuning

链接: https://arxiv.org/abs/2409.04298
作者: Yunhao Bai,Qinji Yu,Boxiang Yun,Dakai Jin,Yingda Xia,Yan Wang
关键词-EN: recently demonstrated exceptional, demonstrated exceptional performance, recently demonstrated, demonstrated exceptional, exceptional performance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 4 figures

点击查看摘要

Abstract:The Segment Anything Model 2 (SAM2) has recently demonstrated exceptional performance in zero-shot prompt segmentation for natural images and videos. However, it faces significant challenges when applied to medical images. Since its release, many attempts have been made to adapt SAM2’s segmentation capabilities to the medical imaging domain. These efforts typically involve using a substantial amount of labeled data to fine-tune the model’s weights. In this paper, we explore SAM2 from a different perspective via making the full use of its trained memory attention module and its ability of processing mask prompts. We introduce FS-MedSAM2, a simple yet effective framework that enables SAM2 to achieve superior medical image segmentation in a few-shot setting, without the need for fine-tuning. Our framework outperforms the current state-of-the-arts on two publicly available medical image datasets. The code is available at this https URL.

[CV-16] Cycle Pixel Difference Network for Crisp Edge Detection

链接: https://arxiv.org/abs/2409.04272
作者: Changsong Liu,Wei Zhang,Yanyan Liu,Mingyang Li,Wenlin Li,Yimeng Fan,Xiangnan Bai,Liang Zhangd
关键词-EN: garnered increasing attention, computer vision, increasing attention, fundamental task, task in computer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Edge detection, as a fundamental task in computer vision, has garnered increasing attention. The advent of deep learning has significantly advanced this field. However, recent deep learning-based methods which rely on large-scale pre-trained weights cannot be trained from scratch, with very limited research addressing this issue. This paper proposes a novel cycle pixel difference convolution (CPDC), which effectively integrates image gradient information with modern convolution operations. Based on the CPDC, we develop a U-shape encoder-decoder model named CPD-Net, which is a purely end-to-end network. Additionally, to address the issue of edge thickness produced by most existing methods, we construct a multi-scale information enhancement module (MSEM) to enhance the discriminative ability of the model, thereby generating crisp and clean contour maps. Comprehensive experiments conducted on three standard benchmarks demonstrate that our method achieves competitive performance on the BSDS500 dataset (ODS=0.813), NYUD-V2 (ODS=0.760), and BIPED dataset (ODS=0.898). Our approach provides a novel perspective for addressing these challenges in edge detection.

[CV-17] Hybrid Cost Volume for Memory-Efficient Optical Flow

链接: https://arxiv.org/abs/2409.04243
作者: Yang Zhao,Gangwei Xu,Gang Wu
关键词-EN: cost volumes, cost, Hybrid Cost Volume, Current, all-pairs cost volumes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:Current state-of-the-art flow methods are mostly based on dense all-pairs cost volumes. However, as image resolution increases, the computational and spatial complexity of constructing these cost volumes grows at a quartic rate, making these methods impractical for high-resolution images. In this paper, we propose a novel Hybrid Cost Volume for memory-efficient optical flow, named HCV. To construct HCV, we first propose a Top-k strategy to separate the 4D cost volume into two global 3D cost volumes. These volumes significantly reduce memory usage while retaining a substantial amount of matching information. We further introduce a local 4D cost volume with a local search space to supplement the local information for HCV. Based on HCV, we design a memory-efficient optical flow network, named HCVFlow. Compared to the recurrent flow methods based the all-pairs cost volumes, our HCVFlow significantly reduces memory consumption while ensuring high accuracy. We validate the effectiveness and efficiency of our method on the Sintel and KITTI datasets and real-world 4K (2160*3840) resolution images. Extensive experiments show that our HCVFlow has very low memory usage and outperforms other memory-efficient methods in terms of accuracy. The code is publicly available at this https URL.

[CV-18] Calibration of Network Confidence for Unsupervised Domain Adaptation Using Estimated Accuracy

链接: https://arxiv.org/abs/2409.04241
作者: Coby Penso,Jacob Goldberger
关键词-EN: target domain, target domain makes, calibrating network confidence, target, domain
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This study addresses the problem of calibrating network confidence while adapting a model that was originally trained on a source domain to a target domain using unlabeled samples from the target domain. The absence of labels from the target domain makes it impossible to directly calibrate the adapted network on the target domain. To tackle this challenge, we introduce a calibration procedure that relies on estimating the network’s accuracy on the target domain. The network accuracy is first computed on the labeled source data and then is modified to represent the actual accuracy of the model on the target domain. The proposed algorithm calibrates the prediction confidence directly in the target domain by minimizing the disparity between the estimated accuracy and the computed confidence. The experimental results show that our method significantly outperforms existing methods, which rely on importance weighting, across several standard datasets.

[CV-19] UniDet3D: Multi-dataset Indoor 3D Object Detection

链接: https://arxiv.org/abs/2409.04234
作者: Maksim Kolodiazhnyi,Anna Vorontsova,Matvey Skripkin,Danila Rukhovich,Anton Konushin
关键词-EN: Growing customer demand, attracted considerable attention, Growing customer, object detection, object detection model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Growing customer demand for smart solutions in robotics and augmented reality has attracted considerable attention to 3D object detection from point clouds. Yet, existing indoor datasets taken individually are too small and insufficiently diverse to train a powerful and general 3D object detection model. In the meantime, more general approaches utilizing foundation models are still inferior in quality to those based on supervised training for a specific task. In this work, we propose \ours, a simple yet effective 3D object detection model, which is trained on a mixture of indoor datasets and is capable of working in various indoor environments. By unifying different label spaces, \ours enables learning a strong representation across multiple datasets through a supervised joint training scheme. The proposed network architecture is built upon a vanilla transformer encoder, making it easy to run, customize and extend the prediction pipeline for practical use. Extensive experiments demonstrate that \ours obtains significant gains over existing 3D object detection methods in 6 indoor benchmarks: ScanNet (+1.1 mAP50), ARKitScenes (+19.4 mAP25), S3DIS (+9.1 mAP50), MultiScan (+9.3 mAP50), 3RScan (+3.2 mAP50), and ScanNet++ (+2.7 mAP50). Code is available at this https URL .

[CV-20] MpoxMamba: A Grouped Mamba-based Lightweight Hybrid Network for Mpox Detection

链接: https://arxiv.org/abs/2409.04218
作者: Yubiao Yue,Jun Xue,Haihuang Liang,Zhenzhang Li,Yufeng Wang
关键词-EN: World Health Organization, Health Organization, World Health, public health emergency, health emergency
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Due to the lack of effective mpox detection tools, the mpox virus continues to spread worldwide and has once again been declared a public health emergency of international concern by the World Health Organization. Deep learning-based mpox detection tools are crucial to alleviate mpox outbreak. However, existing methods have difficulty in achieving a good trade-off between detection performance, parameter size, and model complexity, which is crucial for practical applications and widespread deployment, especially in resource-limited scenarios. Given that the success of Mamba in modeling long-range dependencies and its linear complexity, we proposed a lightweight hybrid architecture called MpoxMamba. MpoxMamba utilizes deep separable convolutions to extract local feature representations in mpox skin lesions, and greatly enhances the model’s ability to model the global contextual information by grouped Mamba modules. Experimental results on two widely recognized mpox datasets demonstrate that MpoxMamba outperforms existing mpox detection methods and state-of-the-art lightweight models. We also developed a web-based online application to provide free mpox detection services to the public in the epidemic areas (this http URL). The source codes of MpoxMamba are available at this https URL.

[CV-21] Diagram Formalization Enhanced Multi-Modal Geometry Problem Solver

链接: https://arxiv.org/abs/2409.04214
作者: Zeren Zhang,Jo-Ku Cheng,Jingyang Deng,Lu Tian,Jinwen Ma,Ziran Qin,Xiaokai Zhang,Na Zhu,Tuo Leng
关键词-EN: Mathematical reasoning remains, Mathematical reasoning, Geometry Problem Solver, Enhanced Geometry Problem, geometry problem
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Mathematical reasoning remains an ongoing challenge for AI models, especially for geometry problems that require both linguistic and visual signals. As the vision encoders of most MLLMs are trained on natural scenes, they often struggle to understand geometric diagrams, performing no better in geometry problem solving than LLMs that only process text. This limitation is amplified by the lack of effective methods for representing geometric relationships. To address these issues, we introduce the Diagram Formalization Enhanced Geometry Problem Solver (DFE-GPS), a new framework that integrates visual features, geometric formal language, and natural language representations. We propose a novel synthetic data approach and create a large-scale geometric dataset, SynthGeo228K, annotated with both formal and natural language captions, designed to enhance the vision encoder for a better understanding of geometric structures. Our framework improves MLLMs’ ability to process geometric diagrams and extends their application to open-ended tasks on the formalgeo7k dataset.

[CV-22] Learning to Learn Transferable Generative Attack for Person Re-Identification

链接: https://arxiv.org/abs/2409.04208
作者: Yuan Bian,Min Liu,Xueping Wang,Yunfeng Ma,Yaonan Wang
关键词-EN: Deep learning-based person, learning-based person re-identification, Deep learning-based, deep networks, person re-identification
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep learning-based person re-identification (re-id) models are widely employed in surveillance systems and inevitably inherit the vulnerability of deep networks to adversarial attacks. Existing attacks merely consider cross-dataset and cross-model transferability, ignoring the cross-test capability to perturb models trained in different domains. To powerfully examine the robustness of real-world re-id models, the Meta Transferable Generative Attack (MTGA) method is proposed, which adopts meta-learning optimization to promote the generative attacker producing highly transferable adversarial examples by learning comprehensively simulated transfer-based cross-model\dataset\test black-box meta attack tasks. Specifically, cross-model\dataset black-box attack tasks are first mimicked by selecting different re-id models and datasets for meta-train and meta-test attack processes. As different models may focus on different feature regions, the Perturbation Random Erasing module is further devised to prevent the attacker from learning to only corrupt model-specific features. To boost the attacker learning to possess cross-test transferability, the Normalization Mix strategy is introduced to imitate diverse feature embedding spaces by mixing multi-domain statistics of target models. Extensive experiments show the superiority of MTGA, especially in cross-model\dataset and cross-model\dataset\test attacks, our MTGA outperforms the SOTA methods by 21.5% and 11.3% on mean mAP drop rate, respectively. The code of MTGA will be released after the paper is accepted.

[CV-23] Introducing Gating and Context into Temporal Action Detection ECCV2024

链接: https://arxiv.org/abs/2409.04205
作者: Aglind Reka,Diana Laura Borza,Dominick Reilly,Michal Balazia,Francois Bremond
关键词-EN: Temporal Action Detection, variable action durations, Action Detection, action durations, remains challenging due
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for publication at the ECCV 2024 ABAW Workshop

点击查看摘要

Abstract:Temporal Action Detection (TAD), the task of localizing and classifying actions in untrimmed video, remains challenging due to action overlaps and variable action durations. Recent findings suggest that TAD performance is dependent on the structural design of transformers rather than on the self-attention mechanism. Building on this insight, we propose a refined feature extraction process through lightweight, yet effective operations. First, we employ a local branch that employs parallel convolutions with varying window sizes to capture both fine-grained and coarse-grained temporal features. This branch incorporates a gating mechanism to select the most relevant features. Second, we introduce a context branch that uses boundary frames as key-value pairs to analyze their relationship with the central frame through cross-attention. The proposed method captures temporal dependencies and improves contextual understanding. Evaluations of the gating mechanism and context branch on challenging datasets (THUMOS14 and EPIC-KITCHEN 100) show a consistent improvement over the baseline and existing methods.

[CV-24] GST: Precise 3D Human Body from a Single Image with Gaussian Splatting Transformers

链接: https://arxiv.org/abs/2409.04196
作者: Lorenza Prospero,Abdullah Hamdi,Joao F. Henriques,Christian Rupprecht
关键词-EN: Reconstructing realistic, human-computer interfaces, creative industries, significant applications, applications in creative
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: preprint

点击查看摘要

Abstract:Reconstructing realistic 3D human models from monocular images has significant applications in creative industries, human-computer interfaces, and healthcare. We base our work on 3D Gaussian Splatting (3DGS), a scene representation composed of a mixture of Gaussians. Predicting such mixtures for a human from a single input image is challenging, as it is a non-uniform density (with a many-to-one relationship with input pixels) with strict physical constraints. At the same time, it needs to be flexible to accommodate a variety of clothes and poses. Our key observation is that the vertices of standardized human meshes (such as SMPL) can provide an adequate density and approximate initial position for Gaussians. We can then train a transformer model to jointly predict comparatively small adjustments to these positions, as well as the other Gaussians’ attributes and the SMPL parameters. We show empirically that this combination (using only multi-view supervision) can achieve fast inference of 3D human models from a single image without test-time optimization, expensive diffusion models, or 3D points supervision. We also show that it can improve 3D pose estimation by better fitting human models that account for clothes and other variations. The code is available on the project website this https URL .

[CV-25] LITE: A Paradigm Shift in Multi-Object Tracking with Efficient ReID Feature Integration ICONIP-2024

链接: https://arxiv.org/abs/2409.04187
作者: Jumabek Alikhanov,Dilshod Obidov,Hakil Kim
关键词-EN: Lightweight Integrated Tracking-Feature, Integrated Tracking-Feature Extraction, Lightweight Integrated, Integrated Tracking-Feature, paradigm is introduced
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 6 figures, to be published in ICONIP-2024

点击查看摘要

Abstract:The Lightweight Integrated Tracking-Feature Extraction (LITE) paradigm is introduced as a novel multi-object tracking (MOT) approach. It enhances ReID-based trackers by eliminating inference, pre-processing, post-processing, and ReID model training costs. LITE uses real-time appearance features without compromising speed. By integrating appearance feature extraction directly into the tracking pipeline using standard CNN-based detectors such as YOLOv8m, LITE demonstrates significant performance improvements. The simplest implementation of LITE on top of classic DeepSORT achieves a HOTA score of 43.03% at 28.3 FPS on the MOT17 benchmark, making it twice as fast as DeepSORT on MOT17 and four times faster on the more crowded MOT20 dataset, while maintaining similar accuracy. Additionally, a new evaluation framework for tracking-by-detection approaches reveals that conventional trackers like DeepSORT remain competitive with modern state-of-the-art trackers when evaluated under fair conditions. The code will be available post-publication at this https URL.

[CV-26] Reprojection Errors as Prompts for Efficient Scene Coordinate Regression ECCV2024

链接: https://arxiv.org/abs/2409.04178
作者: Ting-Ru Liu,Hsuan-Kung Yang,Jou-Min Liu,Chun-Wei Huang,Tsung-Chih Chiang,Quan Kong,Norimasa Kobori,Chun-Yi Lee
关键词-EN: Scene coordinate regression, accurate visual localization, Scene coordinate, existing SCR approaches, coordinate regression
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV2024

点击查看摘要

Abstract:Scene coordinate regression (SCR) methods have emerged as a promising area of research due to their potential for accurate visual localization. However, many existing SCR approaches train on samples from all image regions, including dynamic objects and texture-less areas. Utilizing these areas for optimization during training can potentially hamper the overall performance and efficiency of the model. In this study, we first perform an in-depth analysis to validate the adverse impacts of these areas. Drawing inspiration from our analysis, we then introduce an error-guided feature selection (EGFS) mechanism, in tandem with the use of the Segment Anything Model (SAM). This mechanism seeds low reprojection areas as prompts and expands them into error-guided masks, and then utilizes these masks to sample points and filter out problematic areas in an iterative manner. The experiments demonstrate that our method outperforms existing SCR approaches that do not rely on 3D information on the Cambridge Landmarks and Indoor6 datasets.

[CV-27] Secure Traffic Sign Recognition: An Attention-Enabled Universal Image Inpainting Mechanism against Light Patch Attacks

链接: https://arxiv.org/abs/2409.04133
作者: Hangcheng Cao,Longzhi Yuan,Guowen Xu,Ziyang He,Zhengru Fang,Yuguang Fang
关键词-EN: make informed decisions, sign recognition systems, recognition systems play, sign recognition, play a crucial
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Traffic sign recognition systems play a crucial role in assisting drivers to make informed decisions while driving. However, due to the heavy reliance on deep learning technologies, particularly for future connected and autonomous driving, these systems are susceptible to adversarial attacks that pose significant safety risks to both personal and public transportation. Notably, researchers recently identified a new attack vector to deceive sign recognition systems: projecting well-designed adversarial light patches onto traffic signs. In comparison with traditional adversarial stickers or graffiti, these emerging light patches exhibit heightened aggression due to their ease of implementation and outstanding stealthiness. To effectively counter this security threat, we propose a universal image inpainting mechanism, namely, SafeSign. It relies on attention-enabled multi-view image fusion to repair traffic signs contaminated by adversarial light patches, thereby ensuring the accurate sign recognition. Here, we initially explore the fundamental impact of malicious light patches on the local and global feature spaces of authentic traffic signs. Then, we design a binary mask-based U-Net image generation pipeline outputting diverse contaminated sign patterns, to provide our image inpainting model with needed training data. Following this, we develop an attention mechanism-enabled neural network to jointly utilize the complementary information from multi-view images to repair contaminated signs. Finally, extensive experiments are conducted to evaluate SafeSign’s effectiveness in resisting potential light patch-based attacks, bringing an average accuracy improvement of 54.8% in three widely-used sign recognition models

[CV-28] Confidence-Aware Document OCR Error Detection

链接: https://arxiv.org/abs/2409.04117
作者: Arthur Hemmer,Mickaël Coustaty,Nicola Bartolo,Jean-Marc Ogier
关键词-EN: Optical Character Recognition, Optical Character, Character Recognition, impact subsequent applications, OCR confidence scores
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Optical Character Recognition (OCR) continues to face accuracy challenges that impact subsequent applications. To address these errors, we explore the utility of OCR confidence scores for enhancing post-OCR error detection. Our study involves analyzing the correlation between confidence scores and error rates across different OCR systems. We develop ConfBERT, a BERT-based model that incorporates OCR confidence scores into token embeddings and offers an optional pre-training phase for noise adjustment. Our experimental results demonstrate that integrating OCR confidence scores can enhance error detection capabilities. This work underscores the importance of OCR confidence scores in improving detection accuracy and reveals substantial disparities in performance between commercial and open-source OCR technologies.

[CV-29] Smooth-edged Perturbations Improve Perturbation-based Image Explanations

链接: https://arxiv.org/abs/2409.04116
作者: Gustav Grund Pihlgren,Kary Främling
关键词-EN: Perturbation-based post-hoc image, explain image prediction, image prediction models, Perturbation-based post-hoc, post-hoc image explanation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This manuscript have been submitted to NLDL 2025

点击查看摘要

Abstract:Perturbation-based post-hoc image explanation methods are commonly used to explain image prediction models by perturbing parts of the input to measure how those parts affect the output. Due to the intractability of perturbing each pixel individually, images are typically attributed to larger segments. The Randomized Input Sampling for Explanations (RISE) method solved this issue by using smooth perturbation masks. While this method has proven effective and popular, it has not been investigated which parts of the method are responsible for its success. This work tests many combinations of mask sampling, segmentation techniques, smoothing, and attribution calculation. The results show that the RISE-style pixel attribution is beneficial to all evaluated methods. Furthermore, it is shown that attribution calculation is the least impactful parameter. The implementation of this work is available online: this https URL. Comments: This manuscript have been submitted to NLDL 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.04116 [cs.CV] (or arXiv:2409.04116v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.04116 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-30] MixNet: Joining Force of Classical and Modern Approaches Toward the Comprehensive Pipeline in Motor Imagery EEG Classification

链接: https://arxiv.org/abs/2409.04104
作者: Phairot Autthasan,Rattanaphon Chaisaen,Huy Phan,Maarten De Vos,Theerawit Wilaiprasitporn
关键词-EN: impacted motor imagery, significantly impacted motor, Recent advances, based brain-computer interface, motor imagery
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
*备注: Supplementary materials and source codes are available on-line at this https URL

点击查看摘要

Abstract:Recent advances in deep learning (DL) have significantly impacted motor imagery (MI)-based brain-computer interface (BCI) systems, enhancing the decoding of electroencephalography (EEG) signals. However, most studies struggle to identify discriminative patterns across subjects during MI tasks, limiting MI classification performance. In this article, we propose MixNet, a novel classification framework designed to overcome this limitation by utilizing spectral-spatial signals from MI data, along with a multitask learning architecture named MIN2Net, for classification. Here, the spectral-spatial signals are generated using the filter-bank common spatial patterns (FBCSPs) method on MI data. Since the multitask learning architecture is used for the classification task, the learning in each task may exhibit different generalization rates and potential overfitting across tasks. To address this issue, we implement adaptive gradient blending, simultaneously regulating multiple loss weights and adjusting the learning pace for each task based on its generalization/overfitting tendencies. Experimental results on six benchmark data sets of different data sizes demonstrate that MixNet consistently outperforms all state-of-the-art algorithms in subject-dependent and -independent settings. Finally, the low-density EEG MI classification results show that MixNet outperforms all state-of-the-art algorithms, offering promising implications for Internet of Thing (IoT) applications, such as lightweight and portable EEG wearable devices based on low-density montages.

[CV-31] UNIT: Unifying Image and Text Recognition in One Vision Encoder

链接: https://arxiv.org/abs/2409.04095
作者: Yi Zhu,Yanpeng Zhou,Chunwei Wang,Yang Cao,Jianhua Han,Lu Hou,Hang Xu
关键词-EN: simultaneously support text, Vision Transformers, human visual recognition, support text recognition, typically excel
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Currently, vision encoder models like Vision Transformers (ViTs) typically excel at image recognition tasks but cannot simultaneously support text recognition like human visual recognition. To address this limitation, we propose UNIT, a novel training framework aimed at UNifying Image and Text recognition within a single model. Starting with a vision encoder pre-trained with image recognition tasks, UNIT introduces a lightweight language decoder for predicting text outputs and a lightweight vision decoder to prevent catastrophic forgetting of the original image encoding capabilities. The training process comprises two stages: intra-scale pretraining and inter-scale finetuning. During intra-scale pretraining, UNIT learns unified representations from multi-scale inputs, where images and documents are at their commonly used resolution, to enable fundamental recognition capability. In the inter-scale finetuning stage, the model introduces scale-exchanged data, featuring images and documents at resolutions different from the most commonly used ones, to enhance its scale robustness. Notably, UNIT retains the original vision encoder architecture, making it cost-free in terms of inference and deployment. Experiments across multiple benchmarks confirm that our method significantly outperforms existing methods on document-related tasks (e.g., OCR and DocQA) while maintaining the performances on natural images, demonstrating its ability to substantially enhance text recognition without compromising its core image recognition capabilities.

[CV-32] Introducing a Class-Aware Metric for Monocular Depth Estimation: An Automotive Perspective ECCV

链接: https://arxiv.org/abs/2409.04086
作者: Tim Bader,Leon Eisemann,Adrian Pogorzelski,Namrata Jangid,Attila-Balazs Kis
关键词-EN: increasing accuracy reports, depth estimation models, monocular depth estimation, estimation models lead, metric monocular depth
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Accepted at the European Conference on Computer Vision (ECCV) 2024 Workshop on Out Of Distribution Generalization in Computer Vision

点击查看摘要

Abstract:The increasing accuracy reports of metric monocular depth estimation models lead to a growing interest from the automotive domain. Current model evaluations do not provide deeper insights into the models’ performance, also in relation to safety-critical or unseen classes. Within this paper, we present a novel approach for the evaluation of depth estimation models. Our proposed metric leverages three components, a class-wise component, an edge and corner image feature component, and a global consistency retaining component. Classes are further weighted on their distance in the scene and on criticality for automotive applications. In the evaluation, we present the benefits of our metric through comparison to classical metrics, class-wise analytics, and the retrieval of critical situations. The results show that our metric provides deeper insights into model results while fulfilling safety-critical requirements. We release the code and weights on the following repository: \hrefthis https URL

[CV-33] SDformerFlow: Spatiotemporal swin spikeformer for event-based optical flow estimation

链接: https://arxiv.org/abs/2409.04082
作者: Yi Tian,Juan Andrade-Cetto
关键词-EN: event streams capturing, cameras generate asynchronous, Event cameras generate, sparse event streams, light intensity
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Under Review

点击查看摘要

Abstract:Event cameras generate asynchronous and sparse event streams capturing changes in light intensity. They offer significant advantages over conventional frame-based cameras, such as a higher dynamic range and an extremely faster data rate, making them particularly useful in scenarios involving fast motion or challenging lighting conditions. Spiking neural networks (SNNs) share similar asynchronous and sparse characteristics and are well-suited for processing data from event cameras. Inspired by the potential of transformers and spike-driven transformers (spikeformers) in other computer vision tasks, we propose two solutions for fast and robust optical flow estimation for event cameras: STTFlowNet and SDformerFlow. STTFlowNet adopts a U-shaped artificial neural network (ANN) architecture with spatiotemporal shifted window self-attention (swin) transformer encoders, while SDformerFlow presents its fully spiking counterpart, incorporating swin spikeformer encoders. Furthermore, we present two variants of the spiking version with different neuron models. Our work is the first to make use of spikeformers for dense optical flow estimation. We conduct end-to-end training for all models using supervised learning. Our results yield state-of-the-art performance among SNN-based event optical flow methods on both the DSEC and MVSEC datasets, and show significant reduction in power consumption compared to the equivalent ANNs.

[CV-34] Site-Specific Color Features of Green Coffee Beans

链接: https://arxiv.org/abs/2409.04068
作者: Shu-Min Tan,Shih-Hsun Hung,Je-Chiang Tsai
关键词-EN: valuable primary commodities, green coffee beans, primary commodities, valuable primary, beans
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 21 pages, 7 figures

点击查看摘要

Abstract:Coffee is one of the most valuable primary commodities. Despite this, the common selection technique of green coffee beans relies on personnel visual inspection, which is labor-intensive and subjective. Therefore, an efficient way to evaluate the quality of beans is needed. In this paper, we demonstrate a site-independent approach to find site-specific color features of the seed coat in qualified green coffee beans. We then propose two evaluation schemes for green coffee beans based on this site-specific color feature of qualified beans. Due to the site-specific properties of these color features, machine learning classifiers indicate that compared with the existing evaluation schemes of beans, our evaluation schemes have the advantages of being simple, having less computational costs, and having universal applicability. Finally, this site-specific color feature can distinguish qualified beans from different growing sites. Moreover, this function can prevent cheating in the coffee business and is unique to our evaluation scheme of beans.

[CV-35] D4: Text-guided diffusion model-based domain adaptive data augmentation for vineyard shoot detection

链接: https://arxiv.org/abs/2409.04060
作者: Kentaro Hirahara,Chikahito Nakane,Hajime Ebisawa,Tsuyoshi Kuroda,Yohei Iwaki,Tomoyoshi Utsumi,Yuichiro Nomura,Makoto Koike,Hiroshi Mineno
关键词-EN: training data, data augmentation method, plant phenotyping, gaining attention, generative data augmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In an agricultural field, plant phenotyping using object detection models is gaining attention. However, collecting the training data necessary to create generic and high-precision models is extremely challenging due to the difficulty of annotation and the diversity of domains. Furthermore, it is difficult to transfer training data across different crops, and although machine learning models effective for specific environments, conditions, or crops have been developed, they cannot be widely applied in actual fields. In this study, we propose a generative data augmentation method (D4) for vineyard shoot detection. D4 uses a pre-trained text-guided diffusion model based on a large number of original images culled from video data collected by unmanned ground vehicles or other means, and a small number of annotated datasets. The proposed method generates new annotated images with background information adapted to the target domain while retaining annotation information necessary for object detection. In addition, D4 overcomes the lack of training data in agriculture, including the difficulty of annotation and diversity of domains. We confirmed that this generative data augmentation method improved the mean average precision by up to 28.65% for the BBox detection task and the average precision by up to 13.73% for the keypoint detection task for vineyard shoot detection. Our generative data augmentation method D4 is expected to simultaneously solve the cost and domain diversity issues of training data generation in agriculture and improve the generalization performance of detection models.

[CV-36] COLUMBUS: Evaluating COgnitive Lateral Understanding through Multiple-choice reBUSes AAAI-25

链接: https://arxiv.org/abs/2409.04053
作者: Koen Kraaijveld,Yifan Jiang,Kaixin Ma,Filip Ilievski
关键词-EN: reasoning techniques, catalyzed the development, development of reasoning, focused on vertical, VQA
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 10 figures, submitted to AAAI-25

点击查看摘要

Abstract:While visual question-answering (VQA) benchmarks have catalyzed the development of reasoning techniques, they have focused on vertical thinking. Effective problem-solving also necessitates lateral thinking, which remains understudied in AI and has not been used to test visual perception systems. To bridge this gap, we formulate visual lateral thinking as a multiple-choice question-answering task and describe a three-step taxonomy-driven methodology for instantiating task examples. Then, we develop COLUMBUS, a synthetic benchmark that applies the task pipeline to create QA sets with text and icon rebus puzzles based on publicly available collections of compounds and common phrases. COLUMBUS comprises over 1,000 puzzles, each with four answer candidates. While the SotA vision-language models (VLMs) achieve decent performance, our evaluation demonstrates a substantial gap between humans and models. VLMs benefit from human-curated descriptions but struggle to self-generate such representations at the right level of abstraction.

[CV-37] On Evaluation of Vision Datasets and Models using Human Competency Frameworks

链接: https://arxiv.org/abs/2409.04041
作者: Rahul Ramachandran,Tejal Kulkarni,Charchit Sharma,Deepak Vijaykeerthy,Vineeth N Balasubramanian
关键词-EN: leaderboards relying solely, Evaluating models, challenging task, Item Response Theory, computer vision remains
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Evaluating models and datasets in computer vision remains a challenging task, with most leaderboards relying solely on accuracy. While accuracy is a popular metric for model evaluation, it provides only a coarse assessment by considering a single model’s score on all dataset items. This paper explores Item Response Theory (IRT), a framework that infers interpretable latent parameters for an ensemble of models and each dataset item, enabling richer evaluation and analysis beyond the single accuracy number. Leveraging IRT, we assess model calibration, select informative data subsets, and demonstrate the usefulness of its latent parameters for analyzing and comparing models and datasets in computer vision.

[CV-38] PlantSeg: A Large-Scale In-the-wild Dataset for Plant Disease Segmentation

链接: https://arxiv.org/abs/2409.04038
作者: Tianqi Wei,Zhi Chen,Xin Yu,Scott Chapman,Paul Melloy,Zi Huang
关键词-EN: pose significant threats, Plant, diseases pose significant, pose significant, significant threats
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Plant diseases pose significant threats to agriculture. It necessitates proper diagnosis and effective treatment to safeguard crop yields. To automate the diagnosis process, image segmentation is usually adopted for precisely identifying diseased regions, thereby advancing precision agriculture. Developing robust image segmentation models for plant diseases demands high-quality annotations across numerous images. However, existing plant disease datasets typically lack segmentation labels and are often confined to controlled laboratory settings, which do not adequately reflect the complexity of natural environments. Motivated by this fact, we established PlantSeg, a large-scale segmentation dataset for plant diseases. PlantSeg distinguishes itself from existing datasets in three key aspects. (1) Annotation type: Unlike the majority of existing datasets that only contain class labels or bounding boxes, each image in PlantSeg includes detailed and high-quality segmentation masks, associated with plant types and disease names. (2) Image source: Unlike typical datasets that contain images from laboratory settings, PlantSeg primarily comprises in-the-wild plant disease images. This choice enhances the practical applicability, as the trained models can be applied for integrated disease management. (3) Scale: PlantSeg is extensive, featuring 11,400 images with disease segmentation masks and an additional 8,000 healthy plant images categorized by plant type. Extensive technical experiments validate the high quality of PlantSeg’s annotations. This dataset not only allows researchers to evaluate their image classification methods but also provides a critical foundation for developing and benchmarking advanced plant disease segmentation algorithms.

[CV-39] MultiCounter: Multiple Action Agnostic Repetition Counting in Untrimmed Videos ECAI2024

链接: https://arxiv.org/abs/2409.04035
作者: Yin Tang,Wei Luo,Jinrui Zhang,Wei Huang,Ruihai Jing,Deyu Zhang
关键词-EN: Multi-instance Repetitive Action, repetitive actions performed, Repetitive Action Counting, Multi-instance Repetitive, repetitive actions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECAI 2024

点击查看摘要

Abstract:Multi-instance Repetitive Action Counting (MRAC) aims to estimate the number of repetitive actions performed by multiple instances in untrimmed videos, commonly found in human-centric domains like sports and exercise. In this paper, we propose MultiCounter, a fully end-to-end deep learning framework that enables simultaneous detection, tracking, and counting of repetitive actions of multiple human instances. Specifically, MultiCounter incorporates two novel modules: 1) mixed spatiotemporal interaction for efficient context correlation across consecutive frames, and 2) task-specific heads for accurate perception of periodic boundaries and generalization for action-agnostic human instances. We train MultiCounter on a synthetic dataset called MultiRep generated from annotated real-world videos. Experiments on the MultiRep dataset validate the fundamental challenge of MRAC tasks and showcase the superiority of our proposed model. Compared to ByteTrack+RepNet, a solution that combines an advanced tracker with a single repetition counter, MultiCounter substantially improves Period-mAP by 41.0%, reduces AvgMAE by 58.6%, and increases AvgOBO 1.48 times. This sets a new benchmark in the field of MRAC. Moreover, MultiCounter runs in real-time on a commodity GPU server and is insensitive to the number of human instances in a video.

[CV-40] Dense Hand-Object(HO) GraspNet with Full Grasping Taxonomy and Dynamics ECCV

链接: https://arxiv.org/abs/2409.04033
作者: Woojin Cho,Jihyun Lee,Minjae Yi,Minje Kim,Taeyun Woo,Donghwan Kim,Taewook Ha,Hyokeun Lee,Je-Hwan Ryu,Woontack Woo,Tae-Kyun Kim
关键词-EN: Existing datasets, hand-object interaction, dataset, hand, Existing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages except for references. It will be published at European Conference on Computer Vision(ECCV) 2024

点击查看摘要

Abstract:Existing datasets for 3D hand-object interaction are limited either in the data cardinality, data variations in interaction scenarios, or the quality of annotations. In this work, we present a comprehensive new training dataset for hand-object interaction called HOGraspNet. It is the only real dataset that captures full grasp taxonomies, providing grasp annotation and wide intraclass variations. Using grasp taxonomies as atomic actions, their space and time combinatorial can represent complex hand activities around objects. We select 22 rigid objects from the YCB dataset and 8 other compound objects using shape and size taxonomies, ensuring coverage of all hand grasp configurations. The dataset includes diverse hand shapes from 99 participants aged 10 to 74, continuous video frames, and a 1.5M RGB-Depth of sparse frames with annotations. It offers labels for 3D hand and object meshes, 3D keypoints, contact maps, and \emphgrasp labels. Accurate hand and object 3D meshes are obtained by fitting the hand parametric model (MANO) and the hand implicit function (HALO) to multi-view RGBD frames, with the MoCap system only for objects. Note that HALO fitting does not require any parameter tuning, enabling scalability to the dataset’s size with comparable accuracy to MANO. We evaluate HOGraspNet on relevant tasks: grasp classification and 3D hand pose estimation. The result shows performance variations based on grasp type and object class, indicating the potential importance of the interaction space captured by our dataset. The provided data aims at learning universal shape priors or foundation models for 3D hand-object interaction. Our dataset and code are available at this https URL.

[CV-41] BFA-YOLO: Balanced multiscale object detection network for multi-view building facade attachments detection

链接: https://arxiv.org/abs/2409.04025
作者: Yangguang Chen,Tong Wang,Guanzhou Chen,Kun Zhu,Xiaoliang Tan,Jiaqi Wang,Hong Xie,Wenlin Zhou,Jingyi Zhao,Qing Wang,Xiaolong Luo,Xiaodong Zhang
关键词-EN: air conditioner units, glass curtain walls, curtain walls plays, facade attachments detection, facade attachments
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 22 pages

点击查看摘要

Abstract:Detection of building facade attachments such as doors, windows, balconies, air conditioner units, billboards, and glass curtain walls plays a pivotal role in numerous applications. Building facade attachments detection aids in vbuilding information modeling (BIM) construction and meeting Level of Detail 3 (LOD3) standards. Yet, it faces challenges like uneven object distribution, small object detection difficulty, and background interference. To counter these, we propose BFA-YOLO, a model for detecting facade attachments in multi-view images. BFA-YOLO incorporates three novel innovations: the Feature Balanced Spindle Module (FBSM) for addressing uneven distribution, the Target Dynamic Alignment Task Detection Head (TDATH) aimed at improving small object detection, and the Position Memory Enhanced Self-Attention Mechanism (PMESA) to combat background interference, with each component specifically designed to solve its corresponding challenge. Detection efficacy of deep network models deeply depends on the dataset’s characteristics. Existing open source datasets related to building facades are limited by their single perspective, small image pool, and incomplete category coverage. We propose a novel method for building facade attachments detection dataset construction and construct the BFA-3D dataset for facade attachments detection. The BFA-3D dataset features multi-view, accurate labels, diverse categories, and detailed classification. BFA-YOLO surpasses YOLOv8 by 1.8% and 2.9% in mAP@0.5 on the multi-view BFA-3D and street-view Facade-WHU datasets, respectively. These results underscore BFA-YOLO’s superior performance in detecting facade attachments.

[CV-42] owards Energy-Efficiency by Navigating the Trilemma of Energy Latency and Accuracy

链接: https://arxiv.org/abs/2409.04018
作者: Boyuan Tian,Yihan Pang,Muhammad Huzaifa,Shenlong Wang,Sarita Adve
关键词-EN: Extended Reality, resource constraints, enables immersive experiences, untethered headsets, headsets but suffers
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ISMAR 2024

点击查看摘要

Abstract:Extended Reality (XR) enables immersive experiences through untethered headsets but suffers from stringent battery and resource constraints. Energy-efficient design is crucial to ensure both longevity and high performance in XR devices. However, latency and accuracy are often prioritized over energy, leading to a gap in achieving energy efficiency. This paper examines scene reconstruction, a key building block for immersive XR experiences, and demonstrates how energy efficiency can be achieved by navigating the trilemma of energy, latency, and accuracy. We explore three classes of energy-oriented optimizations, covering the algorithm, execution, and data, that reveal a broad design space through configurable parameters. Our resulting 72 designs expose a wide range of latency and energy trade-offs, with a smaller range of accuracy loss. We identify a Pareto-optimal curve and show that the designs on the curve are achievable only through synergistic co-optimization of all three optimization classes and by considering the latency and accuracy needs of downstream scene reconstruction consumers. Our analysis covering various use cases and measurements on an embedded class system shows that, relative to the baseline, our designs offer energy benefits of up to 60X with potential latency range of 4X slowdown to 2X speedup. Detailed exploration of a use case across representative data sequences from ScanNet showed about 25X energy savings with 1.5X latency reduction and negligible reconstruction quality loss. Comments: ISMAR 2024 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.04018 [cs.CV] (or arXiv:2409.04018v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.04018 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-43] 3D-GP-LMVIC: Learning-based Multi-View Image Coding with 3D Gaussian Geometric Priors

链接: https://arxiv.org/abs/2409.04013
作者: Yujun Huang,Bin Chen,Niu Lian,Baoyi An,Shu-Tao Xia
关键词-EN: Multi-view image, Gaussian Splatting, Gaussian geometric priors, views, Multi-view
类目: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Multimedia (cs.MM)
*备注: 19pages, 8 figures, conference

点击查看摘要

Abstract:Multi-view image compression is vital for 3D-related applications. To effectively model correlations between views, existing methods typically predict disparity between two views on a 2D plane, which works well for small disparities, such as in stereo images, but struggles with larger disparities caused by significant view changes. To address this, we propose a novel approach: learning-based multi-view image coding with 3D Gaussian geometric priors (3D-GP-LMVIC). Our method leverages 3D Gaussian Splatting to derive geometric priors of the 3D scene, enabling more accurate disparity estimation across views within the compression model. Additionally, we introduce a depth map compression model to reduce redundancy in geometric information between views. A multi-view sequence ordering method is also proposed to enhance correlations between adjacent views. Experimental results demonstrate that 3D-GP-LMVIC surpasses both traditional and learning-based methods in performance, while maintaining fast encoding and decoding speed.

[CV-44] Hybrid Mask Generation for Infrared Small Target Detection with Single-Point Supervision

链接: https://arxiv.org/abs/2409.04011
作者: Weijie He,Mushui Liu,Yunlong Yu,Zheming Lu,Xi Li
关键词-EN: Single-frame infrared small, significant challenge due, infrared background clutter, complex infrared background, amidst complex infrared
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 5 figures

点击查看摘要

Abstract:Single-frame infrared small target (SIRST) detection poses a significant challenge due to the requirement to discern minute targets amidst complex infrared background clutter. Recently, deep learning approaches have shown promising results in this domain. However, these methods heavily rely on extensive manual annotations, which are particularly cumbersome and resource-intensive for infrared small targets owing to their minute sizes. To address this limitation, we introduce a Hybrid Mask Generation (HMG) approach that recovers high-quality masks for each target from only a single-point label for network training. Specifically, our HMG approach consists of a handcrafted Points-to-Mask Generation strategy coupled with a pseudo mask updating strategy to recover and refine pseudo masks from point labels. The Points-to-Mask Generation strategy divides two distinct stages: Points-to-Box conversion, where individual point labels are transformed into bounding boxes, and subsequently, Box-to-Mask prediction, where these bounding boxes are elaborated into precise masks. The mask updating strategy integrates the complementary strengths of handcrafted and deep-learning algorithms to iteratively refine the initial pseudo masks. Experimental results across three datasets demonstrate that our method outperforms the existing methods for infrared small target detection with single-point supervision.

[CV-45] Qihoo-T2X: An Efficiency-Focused Diffusion Transformer via Proxy Tokens for Text-to-Any-Task

链接: https://arxiv.org/abs/2409.04005
作者: Jing Wang,Ao Ma,Jiasong Feng,Dawei Leng,Yuhui Yin,Xiaodan Liang
关键词-EN: involves redundant computation, redundant computation due, transformers involves redundant, diffusion transformers involves, Token Diffusion Transformer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The global self-attention mechanism in diffusion transformers involves redundant computation due to the sparse and redundant nature of visual information, and the attention map of tokens within a spatial window shows significant similarity. To address this redundancy, we propose the Proxy Token Diffusion Transformer (PT-DiT), which employs sparse representative token attention (where the number of representative tokens is much smaller than the total number of tokens) to model global visual information efficiently. Specifically, in each transformer block, we randomly sample one token from each spatial-temporal window to serve as a proxy token for that region. The global semantics are captured through the self-attention of these proxy tokens and then injected into all latent tokens via cross-attention. Simultaneously, we introduce window and shift window attention to address the limitations in detail modeling caused by the sparse attention mechanism. Building on the well-designed PT-DiT, we further develop the Qihoo-T2X family, which includes a variety of models for T2I, T2V, and T2MV tasks. Experimental results show that PT-DiT achieves competitive performance while reducing the computational complexity in both image and video generation tasks (e.g., a 48% reduction compared to DiT and a 35% reduction compared to Pixart-alpha). Our source code is available at this https URL.

[CV-46] One-Shot Diffusion Mimicker for Handwritten Text Generation ECCV2024

链接: https://arxiv.org/abs/2409.04004
作者: Gang Dai,Yifan Zhang,Quhui Ke,Qiangya Guo,Shuangping Huang
关键词-EN: Existing handwritten text, Existing handwritten, Existing, One-shot Diffusion Mimicker, sample
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: To appear in ECCV 2024

点击查看摘要

Abstract:Existing handwritten text generation methods often require more than ten handwriting samples as style references. However, in practical applications, users tend to prefer a handwriting generation model that operates with just a single reference sample for its convenience and efficiency. This approach, known as “one-shot generation”, significantly simplifies the process but poses a significant challenge due to the difficulty of accurately capturing a writer’s style from a single sample, especially when extracting fine details from the characters’ edges amidst sparse foreground and undesired background noise. To address this problem, we propose a One-shot Diffusion Mimicker (One-DM) to generate handwritten text that can mimic any calligraphic style with only one reference sample. Inspired by the fact that high-frequency information of the individual sample often contains distinct style patterns (e.g., character slant and letter joining), we develop a novel style-enhanced module to improve the style extraction by incorporating high-frequency components from a single sample. We then fuse the style features with the text content as a merged condition for guiding the diffusion model to produce high-quality handwritten text images. Extensive experiments demonstrate that our method can successfully generate handwriting scripts with just one sample reference in multiple languages, even outperforming previous methods using over ten samples. Our source code is available at this https URL.

[CV-47] DreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving Scenes

链接: https://arxiv.org/abs/2409.04003
作者: Jianbiao Mei,Yukai Ma,Xuemeng Yang,Licheng Wen,Tiantian Wei,Min Dou,Botian Shi,Yong Liu
关键词-EN: facilitated downstream perception, Recent advances, planning tasks, advances in diffusion, facilitated downstream
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Second place solution for W-CODA-Track2

点击查看摘要

Abstract:Recent advances in diffusion models have significantly enhanced the cotrollable generation of streetscapes for and facilitated downstream perception and planning tasks. However, challenges such as maintaining temporal coherence, generating long videos, and accurately modeling driving scenes persist. Accordingly, we propose DreamForge, an advanced diffusion-based autoregressive video generation model designed for the long-term generation of 3D-controllable and extensible video. In terms of controllability, our DreamForge supports flexible conditions such as text descriptions, camera poses, 3D bounding boxes, and road layouts, while also providing perspective guidance to produce driving scenes that are both geometrically and contextually accurate. For consistency, we ensure inter-view consistency through cross-view attention and temporal coherence via an autoregressive architecture enhanced with motion cues. Codes will be available at this https URL.

[CV-48] Boundary feature fusion network for tooth image segmentation MICCAI

链接: https://arxiv.org/abs/2409.03982
作者: Dongping Zhang,Zheng Li,Fangao Zeng,Yutong Wei
关键词-EN: human body identification, dental pathology assessment, medical image segmentation, pathology assessment, critical technology
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: MICCAI workshop,see this https URL

点击查看摘要

Abstract:Tooth segmentation is a critical technology in the field of medical image segmentation, with applications ranging from orthodontic treatment to human body identification and dental pathology assessment. Despite the development of numerous tooth image segmentation models by researchers, a common shortcoming is the failure to account for the challenges of blurred tooth boundaries. Dental diagnostics require precise delineation of tooth boundaries. This paper introduces an innovative tooth segmentation network that integrates boundary information to address the issue of indistinct boundaries between teeth and adjacent tissues. This network’s core is its boundary feature extraction module, which is designed to extract detailed boundary information from high-level features. Concurrently, the feature cross-fusion module merges detailed boundary and global semantic information in a synergistic way, allowing for stepwise layer transfer of feature information. This method results in precise tooth segmentation. In the most recent STS Data Challenge, our methodology was rigorously tested and received a commendable overall score of 0.91. When compared to other existing approaches, this score demonstrates our method’s significant superiority in segmenting tooth boundaries.

[CV-49] Generating Faithful and Salient Text from Multimodal Data

链接: https://arxiv.org/abs/2409.03961
作者: Tahsina Hashem,Weiqing Wang,Derry Tanti Wijaya,Mohammed Eunus Ali,Yuan-Fang Li
关键词-EN: obtained strong performance, multimodal tasks, large multimodal models, large multimodal, obtained strong
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:While large multimodal models (LMMs) have obtained strong performance on many multimodal tasks, they may still hallucinate while generating text. Their performance on detecting salient features from visual data is also unclear. In this paper, we develop a framework to generate faithful and salient text from mixed-modal data, which includes images and structured data ( represented in knowledge graphs or tables). Specifically, we train a small vision critic model to identify hallucinated and non-salient features from the image modality. The critic model also generates a list of salient image features. This information is used in the post editing step to improve the generation quality. Experiments on two datasets show that our framework improves LMMs’ generation quality on both faithfulness and saliency, outperforming recent techniques aimed at reducing hallucination.

[CV-50] FODA-PG for Enhanced Medical Imaging Narrative Generation: Adaptive Differentiation of Normal and Abnormal Attributes

链接: https://arxiv.org/abs/2409.03947
作者: Kai Shu,Yuzhuo Jia,Ziyang Zhang,Jiechao Gao
关键词-EN: Automatic Medical Imaging, Medical Imaging Narrative, Imaging Narrative generation, Imaging Narrative, Narrative generation aims
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Automatic Medical Imaging Narrative generation aims to alleviate the workload of radiologists by producing accurate clinical descriptions directly from radiological images. However, the subtle visual nuances and domain-specific terminology in medical images pose significant challenges compared to generic image captioning tasks. Existing approaches often neglect the vital distinction between normal and abnormal findings, leading to suboptimal performance. In this work, we propose FODA-PG, a novel Fine-grained Organ-Disease Adaptive Partitioning Graph framework that addresses these limitations through domain-adaptive learning. FODA-PG constructs a granular graphical representation of radiological findings by separating disease-related attributes into distinct “disease-specific” and “disease-free” categories based on their clinical significance and location. This adaptive partitioning enables our model to capture the nuanced differences between normal and pathological states, mitigating the impact of data biases. By integrating this fine-grained semantic knowledge into a powerful transformer-based architecture and providing rigorous mathematical justifications for its effectiveness, FODA-PG generates precise and clinically coherent reports with enhanced generalization capabilities. Extensive experiments on the IU-Xray and MIMIC-CXR benchmarks demonstrate the superiority of our approach over state-of-the-art methods, highlighting the importance of domain adaptation in medical report generation.

[CV-51] ropNNC: Structured Neural Network Compression Using Tropical Geometry

链接: https://arxiv.org/abs/2409.03945
作者: Konstantinos Fotopoulos,Petros Maragos,Panagiotis Misiakos
关键词-EN: structured pruning framework, ReLU activations, structured pruning, present TropNNC, convolutional layers
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present TropNNC, a structured pruning framework for compressing neural networks with linear and convolutional layers and ReLU activations. Our approximation is based on a geometrical approach to machine/deep learning, using tropical geometry and extending the work of Misiakos et al. (2022). We use the Hausdorff distance of zonotopes in its standard continuous form to achieve a tighter approximation bound for tropical polynomials compared to Misiakos et al. (2022). This enhancement allows for superior functional approximations of neural networks, leading to a more effective compression algorithm. Our method is significantly easier to implement compared to other frameworks, and does not depend on the availability of training data samples. We validate our framework through extensive empirical evaluations on the MNIST, CIFAR, and ImageNet datasets. Our results demonstrate that TropNNC achieves performance on par with the state-of-the-art method ThiNet, even surpassing it in compressing linear layers, and to the best of our knowledge, it is the first method that achieves this using tropical geometry.

[CV-52] HUMOS: Human Motion Model Conditioned on Body Shape ECCV’24

链接: https://arxiv.org/abs/2409.03944
作者: Shashank Tripathi,Omid Taheri,Christoph Lassner,Michael J. Black,Daniel Holden,Carsten Stoll
关键词-EN: Generating realistic human, graphics applications, Generating realistic, computer vision, vision and graphics
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted in ECCV’24. Project page: this https URL

点击查看摘要

Abstract:Generating realistic human motion is essential for many computer vision and graphics applications. The wide variety of human body shapes and sizes greatly impacts how people move. However, most existing motion models ignore these differences, relying on a standardized, average body. This leads to uniform motion across different body types, where movements don’t match their physical characteristics, limiting diversity. To solve this, we introduce a new approach to develop a generative motion model based on body shape. We show that it’s possible to train this model using unpaired data by applying cycle consistency, intuitive physics, and stability constraints, which capture the relationship between identity and movement. The resulting model generates diverse, physically plausible, and dynamically stable human motions that are both quantitatively and qualitatively more realistic than current state-of-the-art methods. More details are available on our project page this https URL.

[CV-53] Deep Clustering of Remote Sensing Scenes through Heterogeneous Transfer Learning

链接: https://arxiv.org/abs/2409.03938
作者: Isaac Ray,Alexei Skurikhin
关键词-EN: unsupervised whole-image clustering, remote sensing, remote sensing scenes, paper proposes, unsupervised whole-image
类目: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:This paper proposes a method for unsupervised whole-image clustering of a target dataset of remote sensing scenes with no labels. The method consists of three main steps: (1) finetuning a pretrained deep neural network (DINOv2) on a labelled source remote sensing imagery dataset and using it to extract a feature vector from each image in the target dataset, (2) reducing the dimension of these deep features via manifold projection into a low-dimensional Euclidean space, and (3) clustering the embedded features using a Bayesian nonparametric technique to infer the number and membership of clusters simultaneously. The method takes advantage of heterogeneous transfer learning to cluster unseen data with different feature and label distributions. We demonstrate the performance of this approach outperforming state-of-the-art zero-shot classification methods on several remote sensing scene classification datasets.

[CV-54] Data-Efficient Generation for Dataset Distillation

链接: https://arxiv.org/abs/2409.03929
作者: Zhe Li,Weitong Zhang,Sarah Cechnicka,Bernhard Kainz
关键词-EN: exponentially increased data, increased data storage, deep learning techniques, techniques have proven, proven successful
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 7 figures

点击查看摘要

Abstract:While deep learning techniques have proven successful in image-related tasks, the exponentially increased data storage and computation costs become a significant challenge. Dataset distillation addresses these challenges by synthesizing only a few images for each class that encapsulate all essential information. Most current methods focus on matching. The problems lie in the synthetic images not being human-readable and the dataset performance being insufficient for downstream learning tasks. Moreover, the distillation time can quickly get out of bounds when the number of synthetic images per class increases even slightly. To address this, we train a class conditional latent diffusion model capable of generating realistic synthetic images with labels. The sampling time can be reduced to several tens of images per seconds. We demonstrate that models can be effectively trained using only a small set of synthetic images and evaluated on a large real test set. Our approach achieved rank (1) in The First Dataset Distillation Challenge at ECCV 2024 on the CIFAR100 and TinyImageNet datasets.

[CV-55] Image Recognition for Garbage Classification Based on Pixel Distribution Learning

链接: https://arxiv.org/abs/2409.03913
作者: Jenil Kanani
关键词-EN: waste production due, necessitates efficient waste, efficient waste management, waste management strategies, industrial development necessitates
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The exponential growth in waste production due to rapid economic and industrial development necessitates efficient waste management strategies to mitigate environmental pollution and resource depletion. Leveraging advancements in computer vision, this study proposes a novel approach inspired by pixel distribution learning techniques to enhance automated garbage classification. The method aims to address limitations of conventional convolutional neural network (CNN)-based approaches, including computational complexity and vulnerability to image variations. We will conduct experiments using the Kaggle Garbage Classification dataset, comparing our approach with existing models to demonstrate the strength and efficiency of pixel distribution learning in automated garbage classification technologies.

[CV-56] he Role of Generative Systems in Historical Photography Management: A Case Study on Catalan Archives ECCV

链接: https://arxiv.org/abs/2409.03911
作者: Èric Śanchez,Adrià Molina,Oriol Ramos Terrades
关键词-EN: automated photography management, heritage institutions, image analysis, analysis in automated, automated photography
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at ECCV workshop AI4DH

点击查看摘要

Abstract:The use of image analysis in automated photography management is an increasing trend in heritage institutions. Such tools alleviate the human cost associated with the manual and expensive annotation of new data sources while facilitating fast access to the citizenship through online indexes and search engines. However, available tagging and description tools are usually designed around modern photographs in English, neglecting historical corpora in minoritized languages, each of which exhibits intrinsic particularities. The primary objective of this research is to study the quantitative contribution of generative systems in the description of historical sources. This is done by contextualizing the task of captioning historical photographs from the Catalan archives as a case study. Our findings provide practitioners with tools and directions on transfer learning for captioning models based on visual adaptation and linguistic proximity.

[CV-57] On-board Satellite Image Classification for Earth Observation: A Comparative Study of Pre-Trained Vision Transformer Models

链接: https://arxiv.org/abs/2409.03901
作者: Thanh-Dung Le,Vu Nguyen Ha,Ti Ti Nguyen,Geoffrey Eappen,Prabhu Thiruvasagam,Luis M. Garces-Socarras,Hong-fu Chou,Jorge L. Gonzalez-Rios,Juan Carlos Merlano-Duncan,Symeon Chatzinotas
关键词-EN: convolutional neural networks, deep learning techniques, traditionally dominated, neural networks, learning techniques
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Remote sensing image classification is a critical component of Earth observation (EO) systems, traditionally dominated by convolutional neural networks (CNNs) and other deep learning techniques. However, the advent of Transformer-based architectures and large-scale pre-trained models has significantly shifted, offering enhanced performance and efficiency. This study focuses on identifying the most effective pre-trained model for land use classification in onboard satellite processing, emphasizing achieving high accuracy, computational efficiency, and robustness against noisy data conditions commonly encountered during satellite-based inference. Through extensive experimentation, we compared traditional CNN-based models, ResNet-based models, and various pre-trained vision Transformer models. Our findings demonstrate that pre-trained Transformer models, particularly MobileViTV2 and EfficientViT-M2, outperform models trained from scratch in accuracy and efficiency. These models achieve high performance with reduced computational requirements and exhibit greater resilience during inference under noisy conditions. While MobileViTV2 excelled on clean validation data, EfficientViT-M2 proved more robust when handling noise, making it the most suitable model for onboard satellite Earth observation tasks. In conclusion, EfficientViT-M2 is the optimal choice for reliable and efficient remote sensing image classification in satellite operations, achieving 98.76% accuracy, precision, and recall. Specifically, EfficientViT-M2 delivered the highest performance across all metrics, excelled in training efficiency (1,000s) and inference time (10s), and demonstrated greater robustness (overall robustness score at 0.79).

[CV-58] MVTN: A Multiscale Video Transformer Network for Hand Gesture Recognition

链接: https://arxiv.org/abs/2409.03890
作者: Mallika Garg,Debashis Ghosh,Pyari Mohan Pradhan
关键词-EN: Video Transformer Network, hand gesture recognition, Multiscale Video Transformer, gesture recognition, dynamic hand gesture
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:In this paper, we introduce a novel Multiscale Video Transformer Network (MVTN) for dynamic hand gesture recognition, since multiscale features can extract features with variable size, pose, and shape of hand which is a challenge in hand gesture recognition. The proposed model incorporates a multiscale feature hierarchy to capture diverse levels of detail and context within hand gestures which enhances the model’s ability. This multiscale hierarchy is obtained by extracting different dimensions of attention in different transformer stages with initial stages to model high-resolution features and later stages to model low-resolution features. Our approach also leverages multimodal data, utilizing depth maps, infrared data, and surface normals along with RGB images from NVGesture and Briareo datasets. Experiments show that the proposed MVTN achieves state-of-the-art results with less computational complexity and parameters. The source code is available at this https URL.

[CV-59] he Influence of Faulty Labels in Data Sets on Human Pose Estimation

链接: https://arxiv.org/abs/2409.03887
作者: Arnold Schwarz,Levente Hernadi,Felix Bießmann,Kristian Hildebrand
关键词-EN: Human Pose Estimation, Pose Estimation, Human Pose, provide empirical evidence, empirical evidence demonstrating
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 15 pages, 7 figures, 5 tables

点击查看摘要

Abstract:In this study we provide empirical evidence demonstrating that the quality of training data impacts model performance in Human Pose Estimation (HPE). Inaccurate labels in widely used data sets, ranging from minor errors to severe mislabeling, can negatively influence learning and distort performance metrics. We perform an in-depth analysis of popular HPE data sets to show the extent and nature of label inaccuracies. Our findings suggest that accounting for the impact of faulty labels will facilitate the development of more robust and accurate HPE models for a variety of real-world applications. We show improved performance with cleansed data.

[CV-60] Multi-Camera Industrial Open-Set Person Re-Identification and Tracking ECCV2024

链接: https://arxiv.org/abs/2409.03879
作者: Federico Cunico,Marco Cristani
关键词-EN: person re-identification led, deep learning approaches, recent years, impressive results, development of deep
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at T-CAP workshop at ECCV 2024

点击查看摘要

Abstract:In recent years, the development of deep learning approaches for the task of person re-identification led to impressive results. However, this comes with a limitation for industrial and practical real-world applications. Firstly, most of the existing works operate on closed-world scenarios, in which the people to re-identify (probes) are compared to a closed-set (gallery). Real-world scenarios often are open-set problems in which the gallery is not known a priori, but the number of open-set approaches in the literature is significantly lower. Secondly, challenges such as multi-camera setups, occlusions, real-time requirements, etc., further constrain the applicability of off-the-shelf methods. This work presents MICRO-TRACK, a Modular Industrial multi-Camera Re_identification and Open-set Tracking system that is real-time, scalable, and easy to integrate into existing industrial surveillance scenarios. Furthermore, we release a novel Re-ID and tracking dataset acquired in an industrial manufacturing facility, dubbed Facility-ReID, consisting of 18-minute videos captured by 8 surveillance cameras.

[CV-61] Ground-roll Separation From Land Seismic Records Based on Convolutional Neural Network

链接: https://arxiv.org/abs/2409.03878
作者: Zhuang Jia,Wenkai Lu,Meng Zhang,Yongkang Miao
关键词-EN: common coherent noise, land field seismic, field seismic data, Rayleigh-type surface wave, common coherent
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:Ground-roll wave is a common coherent noise in land field seismic data. This Rayleigh-type surface wave usually has low frequency, low apparent velocity, and high amplitude, therefore obscures the reflection events of seismic shot gathers. Commonly used techniques focus on the differences of ground-roll and reflection in transformed domain such as f-k domain, wavelet domain, or curvelet domain. These approaches use a series of fixed atoms or bases to transform the data in time-space domain into transformed domain to separate different waveforms, thus tend to suffer from the complexity for a delicate design of the parameters of the transform domain filter. To deal with these problems, a novel way is proposed to separate ground-roll from reflections using convolutional neural network (CNN) model based method to learn to extract the features of ground-roll and reflections automatically based on training data. In the proposed method, low-pass filtered seismic data which is contaminated by ground-roll wave is used as input of CNN, and then outputs both ground-roll component and low-frequency part of reflection component simultaneously. Discriminative loss is applied together with similarity loss in the training process to enhance the similarity to their train labels as well as the difference between the two outputs. Experiments are conducted on both synthetic and real data, showing that CNN based method can separate ground roll from reflections effectively, and has generalization ability to a certain extent.

[CV-62] Few-shot Adaptation of Medical Vision-Language Models MICCAI2024

链接: https://arxiv.org/abs/2409.03868
作者: Fereshteh Shakeri,Yunshi Huang,Julio Silva-Rodríguez,Houda Bahig,An Tang,Jose Dolz,Ismail Ben Ayed
关键词-EN: medical imaging research, Integrating image, imaging research, data through multi-modal, multi-modal learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: MICCAI 2024 (Spotlight) - Code is available at this https URL

点击查看摘要

Abstract:Integrating image and text data through multi-modal learning has emerged as a new approach in medical imaging research, following its successful deployment in computer vision. While considerable efforts have been dedicated to establishing medical foundation models and their zero-shot transfer to downstream tasks, the popular few-shot setting remains relatively unexplored. Following on from the currently strong emergence of this setting in computer vision, we introduce the first structured benchmark for adapting medical vision-language models (VLMs) in a strict few-shot regime and investigate various adaptation strategies commonly used in the context of natural images. Furthermore, we evaluate a simple generalization of the linear-probe adaptation baseline, which seeks an optimal blending of the visual prototypes and text embeddings via learnable class-wise multipliers. Surprisingly, such a text-informed linear probe yields competitive performances in comparison to convoluted prompt-learning and adapter-based strategies, while running considerably faster and accommodating the black-box setting. Our extensive experiments span three different medical modalities and specialized foundation models, nine downstream tasks, and several state-of-the-art few-shot adaptation methods. We made our benchmark and code publicly available to trigger further developments in this emergent subject: \urlthis https URL.

[CV-63] Assessing the Uncertainty and Robustness of Object Detection Models for Detecting Stickers on Laptops

链接: https://arxiv.org/abs/2409.03782
作者: Chengjie Lu,Jiahui Wu,Shaukat Ali,Mikkel Labori Olsen
关键词-EN: reducing electronic waste, Danish Technological Institute, Refurbishing laptops extends, including laptop refurbishing, electronic waste
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Refurbishing laptops extends their lives while contributing to reducing electronic waste, which promotes building a sustainable future. To this end, the Danish Technological Institute (DTI) focuses on the research and development of several applications, including laptop refurbishing. This has several steps, including cleaning, which involves identifying and removing stickers from laptop surfaces. DTI trained six sticker detection models (SDMs) based on open-source object detection models to identify such stickers precisely so these stickers can be removed automatically. However, given the diversity in types of stickers (e.g., shapes, colors, locations), identification of the stickers is highly uncertain, thereby requiring explicit quantification of uncertainty associated with the identified stickers. Such uncertainty quantification can help reduce risks in removing stickers, which, for example, could otherwise result in damaging laptop surfaces. For uncertainty quantification, we adopted the Monte Carlo Dropout method to evaluate the six SDMs from DTI using three datasets: the original image dataset from DTI and two datasets generated with vision language models, i.e., DALL-E-3 and Stable Diffusion-3. In addition, we presented novel robustness metrics concerning detection accuracy and uncertainty to assess the robustness of the SDMs based on adversarial datasets generated from the three datasets using a dense adversary method. Our evaluation results show that different SDMs perform differently regarding different metrics. Based on the results, we provide SDM selection guidelines and lessons learned from various perspectives.

[CV-64] A Greedy Hierarchical Approach to Whole-Network Filter- Pruning in CNNs

链接: https://arxiv.org/abs/2409.03777
作者: Kiran Purohit,Anurag Reddy Parvathgari,Sourangshu Bhattacharya
关键词-EN: Deep convolutional neural, convolutional neural networks, achieved impressive performance, Deep convolutional, computer vision tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted in TMLR 2024

点击查看摘要

Abstract:Deep convolutional neural networks (CNNs) have achieved impressive performance in many computer vision tasks. However, their large model sizes require heavy computational resources, making pruning redundant filters from existing pre-trained CNNs an essential task in developing efficient models for resource-constrained devices. Whole-network filter pruning algorithms prune varying fractions of filters from each layer, hence providing greater flexibility. Current whole-network pruning methods are either computationally expensive due to the need to calculate the loss for each pruned filter using a training dataset, or use various heuristic / learned criteria for determining the pruning fractions for each layer. This paper proposes a two-level hierarchical approach for whole-network filter pruning which is efficient and uses the classification loss as the final criterion. The lower-level algorithm (called filter-pruning) uses a sparse-approximation formulation based on linear approximation of filter weights. We explore two algorithms: orthogonal matching pursuit-based greedy selection and a greedy backward pruning approach. The backward pruning algorithm uses a novel closed-form error criterion for efficiently selecting the optimal filter at each stage, thus making the whole algorithm much faster. The higher-level algorithm (called layer-selection) greedily selects the best-pruned layer (pruning using the filter-selection algorithm) using a global pruning criterion. We propose algorithms for two different global-pruning criteria: (1) layer-wise relative error (HBGS), and (2) final classification error (HBGTS). Our suite of algorithms outperforms state-of-the-art pruning methods on ResNet18, ResNet32, ResNet56, VGG16, and ResNext101. Our method reduces the RAM requirement for ResNext101 from 7.6 GB to 1.5 GB and achieves a 94% reduction in FLOPS without losing accuracy on CIFAR-10.

[CV-65] EMCNet : Graph-Nets for Electron Micrographs Classification KDD2022

链接: https://arxiv.org/abs/2409.03767
作者: Sakhinana Sagar Srinivas,Rajat Kumar Sarkar,Venkataramana Runkana
关键词-EN: materials processing industries, Characterization of materials, processing industries, materials processing, important and challenging
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 12 pages, 10 figures, Accepted in a ACM SIGKDD 2022 Workshop on Machine Learning for Materials

点击查看摘要

Abstract:Characterization of materials via electron micrographs is an important and challenging task in several materials processing industries. Classification of electron micrographs is complex due to the high intra-class dissimilarity, high inter-class similarity, and multi-spatial scales of patterns. However, existing methods are ineffective in learning complex image patterns. We propose an effective end-to-end electron micrograph representation learning-based framework for nanomaterial identification to overcome the challenges. We demonstrate that our framework outperforms the popular baselines on the open-source datasets in nanomaterials-based identification tasks. The ablation studies are reported in great detail to support the efficacy of our approach.

[CV-66] OpenCap markerless motion capture estimation of lower extremity kinematics and dynamics in cycling

链接: https://arxiv.org/abs/2409.03766
作者: Reza Kakavand,Reza Ahmadi,Atousa Parsaei,W. Brent Edwards,Amin Komeili
关键词-EN: Markerless motion capture, traditional marker-based systems, motion capture offers, misplacement and artifacts, offers several benefits
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Markerless motion capture offers several benefits over traditional marker-based systems by eliminating the need for physical markers, which are prone to misplacement and artifacts. Utilizing computer vision and deep learning algorithms, markerless systems can directly detect human body landmarks, reducing manual processing and errors associated with marker placement. These systems are adaptable, able to track user-defined features, and practical for real-world applications using consumer-grade devices such as smartphone cameras. This study compares the performance of OpenCap, a markerless motion capture system, with traditional marker-based systems in assessing cycling biomechanics. Ten healthy adults participated in experiments to capture sagittal hip, knee, and ankle kinematics and dynamics using both methods. OpenCap used videos from smartphones and integrated computer vision and musculoskeletal simulations to estimate 3D kinematics. Results showed high agreement between the two systems, with no significant differences in kinematic and kinetic measurements for the hip, knee, and ankle. The correlation coefficients exceeded 0.98, indicating very strong consistency. Errors were minimal, with kinematic errors under 4 degrees and kinetic errors below 5 Nm. This study concludes that OpenCap is a viable alternative to marker-based motion capture, offering comparable precision without extensive setup for hip (flexion/extension), knee (flexion/extension), and ankle (dorsiflexion/plantarflexion) joints. Future work should aim to enhance the accuracy of ankle joint measurements and extend analyses to 3D kinematics and kinetics for comprehensive biomechanical assessments.

[CV-67] AI and Entrepreneurship: Facial Recognition Technology Detects Entrepreneurs Outperforming Human Experts

链接: https://arxiv.org/abs/2409.03765
作者: Martin Obschonka,Christian Fisch,Tharindu Fernando,Clinton Fookes
关键词-EN: generally considered personal, considered personal information, autonomy to disclose, generally considered, considered personal
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 46 pages, 2 tables, 11 figures

点击查看摘要

Abstract:Occupational outcomes like entrepreneurship are generally considered personal information that individuals should have the autonomy to disclose. With the advancing capability of artificial intelligence (AI) to infer private details from widely available human-centric data, such as social media, it is crucial to investigate whether AI can accurately extract private occupational information from such data. In this study, we demonstrate that deep neural networks can classify individuals as entrepreneurs based on a single facial image with high accuracy in data sourced from Crunchbase, a premier source for entrepreneurship data. Utilizing a dataset comprising facial images of 40,728 individuals, including both entrepreneurs and non-entrepreneurs, we trained a Convolutional Neural Network (CNN) and evaluated its classification performance. While human experts (n=650) and trained participants (n=133) were unable to classify entrepreneurs with accuracy above chance levels (50%), the AI model achieved a classification accuracy of 79.51%. Several robustness tests show that this high level of accuracy is maintained under various conditions.

[CV-68] Modeling Human Strategy for Flattening Wrinkled Cloth Using Neural Networks

链接: https://arxiv.org/abs/2409.03764
作者: Nilay Kant,Ashrut Aryal,Rajiv Ranganathan,Ranjan Mukherjee,Charles Owen
关键词-EN: flattening wrinkled cloth, wrinkled cloth learning, paper explores, human actions, cloth
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 6 Pages

点击查看摘要

Abstract:This paper explores a novel approach to model strategies for flattening wrinkled cloth learning from humans. A human participant study was conducted where the participants were presented with various wrinkle types and tasked with flattening the cloth using the fewest actions possible. A camera and Aruco marker were used to capture images of the cloth and finger movements, respectively. The human strategies for flattening the cloth were modeled using a supervised regression neural network, where the cloth images served as input and the human actions as output. Before training the neural network, a series of image processing techniques were applied, followed by Principal Component Analysis (PCA) to extract relevant features from each image and reduce the input dimensionality. This reduction decreased the model’s complexity and computational cost. The actions predicted by the neural network closely matched the actual human actions on an independent data set, demonstrating the effectiveness of neural networks in modeling human actions for flattening wrinkled cloth.

[CV-69] A Dataset for Mechanical Mechanisms

链接: https://arxiv.org/abs/2409.03763
作者: Farshid Ghezelbash,Amir Hossein Eskandari,Amir J Bidhendi
关键词-EN: consisting of approximately, aimed at supporting, study introduces, supporting research, Stable Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This study introduces a dataset consisting of approximately 9,000 images of mechanical mechanisms and their corresponding descriptions, aimed at supporting research in mechanism design. The dataset consists of a diverse collection of 2D and 3D sketches, meticulously curated to ensure relevance and quality. We demonstrate the application of this dataset by fine-tuning two models: 1) Stable Diffusion (for generating new mechanical designs), and 2) BLIP-2 (for captioning these designs). While the results from Stable Diffusion show promise, particularly in generating coherent 3D sketches, the model struggles with 2D sketches and occasionally produces nonsensical outputs. These limitations underscore the need for further development, particularly in expanding the dataset and refining model architectures. Nonetheless, this work serves as a step towards leveraging generative AI in mechanical design, highlighting both the potential and current limitations of these approaches.

[CV-70] Efficient Scene Appearance Aggregation for Level-of-Detail Rendering

链接: https://arxiv.org/abs/2409.03761
作者: Yang Zhou,Tao Huang,Ravi Ramamoorthi,Pradeep Sen,Ling-Qi Yan
关键词-EN: Creating an appearance-preserving, challenging problem, Scattering Distribution Function, Aggregated Bidirectional Scattering, Bidirectional Scattering Distribution
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Creating an appearance-preserving level-of-detail (LoD) representation for arbitrary 3D scenes is a challenging problem. The appearance of a scene is an intricate combination of both geometry and material models, and is further complicated by correlation due to the spatial configuration of scene elements. We present a novel volumetric representation for the aggregated appearance of complex scenes and an efficient pipeline for LoD generation and rendering. The core of our representation is the Aggregated Bidirectional Scattering Distribution Function (ABSDF) that summarizes the far-field appearance of all surfaces inside a voxel. We propose a closed-form factorization of the ABSDF that accounts for spatially varying and orientation-varying material parameters. We tackle the challenge of capturing the correlation existing locally within a voxel and globally across different parts of the scene. Our method faithfully reproduces appearance and achieves higher quality than existing scene filtering methods while being inherently efficient to render. The memory footprint and rendering cost of our representation are independent of the original scene complexity.

[CV-71] Exploring Foundation Models for Synthetic Medical Imaging: A Study on Chest X-Rays and Fine-Tuning Techniques

链接: https://arxiv.org/abs/2409.04424
作者: Davide Clode da Silva,Marina Musse Bernardes,Nathalia Giacomini Ceretta,Gabriel Vaz de Souza,Gabriel Fonseca Silva,Rafael Heitor Bordini,Soraia Raupp Musse
关键词-EN: significantly advanced healthcare, Machine learning, treatment identification, learning has significantly, significantly advanced
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Machine learning has significantly advanced healthcare by aiding in disease prevention and treatment identification. However, accessing patient data can be challenging due to privacy concerns and strict regulations. Generating synthetic, realistic data offers a potential solution for overcoming these limitations, and recent studies suggest that fine-tuning foundation models can produce such data effectively. In this study, we explore the potential of foundation models for generating realistic medical images, particularly chest x-rays, and assess how their performance improves with fine-tuning. We propose using a Latent Diffusion Model, starting with a pre-trained foundation model and refining it through various configurations. Additionally, we performed experiments with input from a medical professional to assess the realism of the images produced by each trained model.

[CV-72] he Impact of Scanner Domain Shift on Deep Learning Performance in Medical Imaging: an Experimental Study

链接: https://arxiv.org/abs/2409.04368
作者: Gregory Szumel,Brian Guo,Darui Lu,Rongze Gui,Tingyu Wang,Nicholas Konz,Maciej A. Mazurowski
关键词-EN: scanner domain shift, protocols can differ, differ substantially, Purpose, scanner domain
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Purpose: Medical images acquired using different scanners and protocols can differ substantially in their appearance. This phenomenon, scanner domain shift, can result in a drop in the performance of deep neural networks which are trained on data acquired by one scanner and tested on another. This significant practical issue is well-acknowledged, however, no systematic study of the issue is available across different modalities and diagnostic tasks. Materials and Methods: In this paper, we present a broad experimental study evaluating the impact of scanner domain shift on convolutional neural network performance for different automated diagnostic tasks. We evaluate this phenomenon in common radiological modalities, including X-ray, CT, and MRI. Results: We find that network performance on data from a different scanner is almost always worse than on same-scanner data, and we quantify the degree of performance drop across different datasets. Notably, we find that this drop is most severe for MRI, moderate for X-ray, and quite small for CT, on average, which we attribute to the standardized nature of CT acquisition systems which is not present in MRI or X-ray. We also study how injecting varying amounts of target domain data into the training set, as well as adding noise to the training data, helps with generalization. Conclusion: Our results provide extensive experimental evidence and quantification of the extent of performance drop caused by scanner domain shift in deep learning across different modalities, with the goal of guiding the future development of robust deep learning models for medical image analysis.

[CV-73] CISCA and CytoDArk0: a Cell Instance Segmentation and Classification method for histo(patho)logical image Analyses and a new open Nissl-stained dataset for brain cytoarchitecture studies

链接: https://arxiv.org/abs/2409.04175
作者: Valentina Vadori,Jean-Marie Graïc,Antonella Peruffo,Giulia Vadori,Livio Finos,Enrico Grisan
关键词-EN: complex task, biological investigations, Delineating and classifying, pivotal endeavor, medical and biological
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Delineating and classifying individual cells in microscopy tissue images is a complex task, yet it is a pivotal endeavor in various medical and biological investigations. We propose a new deep learning framework (CISCA) for automatic cell instance segmentation and classification in histological slices to support detailed morphological and structural analysis or straightforward cell counting in digital pathology workflows and brain cytoarchitecture studies. At the core of CISCA lies a network architecture featuring a lightweight U-Net with three heads in the decoder. The first head classifies pixels into boundaries between neighboring cells, cell bodies, and background, while the second head regresses four distance maps along four directions. The network outputs from the first and second heads are integrated through a tailored post-processing step, which ultimately yields the segmentation of individual cells. A third head enables simultaneous classification of cells into relevant classes, if required. We showcase the effectiveness of our method using four datasets, including CoNIC, PanNuke, and MoNuSeg, which are publicly available H\E datasets. Additionally, we introduce CytoDArk0, a novel dataset consisting of Nissl-stained images of the cortex, cerebellum, and hippocampus from mammals belonging to the orders Cetartiodactyla and Primates. We evaluate CISCA in comparison to other state-of-the-art methods, demonstrating CISCA’s robustness and accuracy in segmenting and classifying cells across diverse tissue types, magnifications, and staining techniques.

[CV-74] Optical Coherence Tomography Angiography-OCTA dataset for the study of Diabetic Retinopathy

链接: https://arxiv.org/abs/2409.04137
作者: Pooja Bidwai,Shilpa Gite,Biswajeet Pradhan,Aditi Gupta,Kishore pahuja
关键词-EN: Natasha Eye Care, Institute in Pune, Natasha Eye, Eye Care, Care and Research
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages

点击查看摘要

Abstract:This study presents a dataset consisting of 268 retinal images from 179 individuals, including 133 left-eye and 135 right-eye images, collected from Natasha Eye Care and Research Institute in Pune, Maharashtra, India. The images were captured using a nonmydriatic Optical Coherence Tomography Angiography (OCTA) device, specifically the Optovue Avanti Edition machine as per the protocol mentioned in this paper. Two ophthalmologists then annotated the images. This dataset can be used by researchers and doctors to develop automated diagnostic tools for early detection of diabetic retinopathy (DR).

[CV-75] EigenSR: Eigenimage-Bridged Pre-Trained RGB Learners for Single Hyperspectral Image Super-Resolution AAAI2025

链接: https://arxiv.org/abs/2409.04050
作者: Xi Su,Xiangfei Shen,Mingyang Wan,Jing Nie,Lihui Chen,Haijun Liu,Xichuan Zhou
关键词-EN: single input low-resolution, Single hyperspectral image, pre-trained RGB model, input low-resolution HSI, hyperspectral image super-resolution
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to AAAI 2025

点击查看摘要

Abstract:Single hyperspectral image super-resolution (single-HSI-SR) aims to improve the resolution of a single input low-resolution HSI. Due to the bottleneck of data scarcity, the development of single-HSI-SR lags far behind that of RGB natural images. In recent years, research on RGB SR has shown that models pre-trained on large-scale benchmark datasets can greatly improve performance on unseen data, which may stand as a remedy for HSI. But how can we transfer the pre-trained RGB model to HSI, to overcome the data-scarcity bottleneck? Because of the significant difference in the channels between the pre-trained RGB model and the HSI, the model cannot focus on the correlation along the spectral dimension, thus limiting its ability to utilize on HSI. Inspired by the HSI spatial-spectral decoupling, we propose a new framework that first fine-tunes the pre-trained model with the spatial components (known as eigenimages), and then infers on unseen HSI using an iterative spectral regularization (ISR) to maintain the spectral correlation. The advantages of our method lie in: 1) we effectively inject the spatial texture processing capabilities of the pre-trained RGB model into HSI while keeping spectral fidelity, 2) learning in the spectral-decorrelated domain can improve the generalizability to spectral-agnostic data, and 3) our inference in the eigenimage domain naturally exploits the spectral low-rank property of HSI, thereby reducing the complexity. This work bridges the gap between pre-trained RGB models and HSI via eigenimages, addressing the issue of limited HSI training data, hence the name EigenSR. Extensive experiments show that EigenSR outperforms the state-of-the-art (SOTA) methods in both spatial and spectral metrics. Our code will be released.

[CV-76] Bi-modality Images Transfer with a Discrete Process Matching Method

链接: https://arxiv.org/abs/2409.03977
作者: Zhe Xiong,Qiaoqiao Ding,Xiaoqun Zhang
关键词-EN: medical image synthesis, image synthesis gains, rapid development, image synthesis, medical image
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, medical image synthesis gains more and more popularity, along with the rapid development of generative models. Medical image synthesis aims to generate an unacquired image modality, often from other observed data modalities. Synthesized images can be used for clinical diagnostic assistance, data augmentation for model training and validation or image quality improving. In the meanwhile, the flow-based models are among the successful generative models for the ability of generating realistic and high-quality synthetic images. However, most flow-based models require to calculate flow ordinary different equation (ODE) evolution steps in transfer process, for which the performances are significantly limited by heavy computation time due to a large number of time iterations. In this paper, we propose a novel flow-based model, namely Discrete Process Matching (DPM) to accomplish the bi-modality image transfer tasks. Different to other flow matching based models, we propose to utilize both forward and backward ODE flow and enhance the consistency on the intermediate images of few discrete time steps, resulting in a transfer process with much less iteration steps while maintaining high-quality generations for both modalities. Our experiments on three datasets of MRI T1/T2 and CT/MRI demonstrate that DPM outperforms other state-of-the-art flow-based methods for bi-modality image synthesis, achieving higher image quality with less computation time cost.

[CV-77] Recon-all-clinical: Cortical surface reconstruction and analysis of heterogeneous clinical brain MRI

链接: https://arxiv.org/abs/2409.03889
作者: Karthik Gopinath,Douglas N. Greve,Colin Magdamo,Steve Arnold,Sudeshna Das,Oula Puonti,Juan Eugenio Iglesias
关键词-EN: Surface-based analysis, MRI, cerebral cortex, cortex is ubiquitous, ubiquitous in human
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages in the manuscript with 11 page supplementary material

点击查看摘要

Abstract:Surface-based analysis of the cerebral cortex is ubiquitous in human neuroimaging with MRI. It is crucial for cortical registration, parcellation, and thickness estimation. Traditionally, these analyses require high-resolution, isotropic scans with good gray-white matter contrast, typically a 1mm T1-weighted scan. This excludes most clinical MRI scans, which are often anisotropic and lack the necessary T1 contrast. To enable large-scale neuroimaging studies using vast clinical data, we introduce recon-all-clinical, a novel method for cortical reconstruction, registration, parcellation, and thickness estimation in brain MRI scans of any resolution and contrast. Our approach employs a hybrid analysis method that combines a convolutional neural network (CNN) trained with domain randomization to predict signed distance functions (SDFs) and classical geometry processing for accurate surface placement while maintaining topological and geometric constraints. The method does not require retraining for different acquisitions, thus simplifying the analysis of heterogeneous clinical datasets. We tested recon-all-clinical on multiple datasets, including over 19,000 clinical scans. The method consistently produced precise cortical reconstructions and high parcellation accuracy across varied MRI contrasts and resolutions. Cortical thickness estimates are precise enough to capture aging effects independently of MRI contrast, although accuracy varies with slice thickness. Our method is publicly available at this https URL, enabling researchers to perform detailed cortical analysis on the huge amounts of already existing clinical MRI scans. This advancement may be particularly valuable for studying rare diseases and underrepresented populations where research-grade MRI data is scarce.

[CV-78] Mpox Screen Lite: AI-Driven On-Device Offline Mpox Screening for Low-Resource African Mpox Emergency Response

链接: https://arxiv.org/abs/2409.03806
作者: Yudara Kularathne,Prathapa Janitha,Sithira Ambepitiya
关键词-EN: highlighted critical gaps, Africa with clade, severe in Africa, Mpox, highlighted critical
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 Pages, 2 Figures, 3 Tables

点击查看摘要

Abstract:Background: The 2024 Mpox outbreak, particularly severe in Africa with clade 1b emergence, has highlighted critical gaps in diagnostic capabilities in resource-limited settings. This study aimed to develop and validate an artificial intelligence (AI)-driven, on-device screening tool for Mpox, designed to function offline in low-resource environments. Methods: We developed a YOLOv8n-based deep learning model trained on 2,700 images (900 each of Mpox, other skin conditions, and normal skin), including synthetic data. The model was validated on 360 images and tested on 540 images. A larger external validation was conducted using 1,500 independent images. Performance metrics included accuracy, precision, recall, F1-score, sensitivity, and specificity. Findings: The model demonstrated high accuracy (96%) in the final test set. For Mpox detection, it achieved 93% precision, 97% recall, and an F1-score of 95%. Sensitivity and specificity for Mpox detection were 97% and 96%, respectively. Performance remained consistent in the larger external validation, confirming the model’s robustness and generalizability. Interpretation: This AI-driven screening tool offers a rapid, accurate, and scalable solution for Mpox detection in resource-constrained settings. Its offline functionality and high performance across diverse datasets suggest significant potential for improving Mpox surveillance and management, particularly in areas lacking traditional diagnostic infrastructure. Comments: 11 Pages, 2 Figures, 3 Tables Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.03806 [eess.IV] (or arXiv:2409.03806v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2409.03806 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Yudara Kularathne [view email] [v1] Thu, 5 Sep 2024 11:18:34 UTC (331 KB)

[CV-79] Evaluating Machine Learning-based Skin Cancer Diagnosis

链接: https://arxiv.org/abs/2409.03794
作者: Tanish Jain
关键词-EN: skin cancer detection, deep learning models, cancer detection, evaluates the reliability, deep learning
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 14 pages

点击查看摘要

Abstract:This study evaluates the reliability of two deep learning models for skin cancer detection, focusing on their explainability and fairness. Using the HAM10000 dataset of dermatoscopic images, the research assesses two convolutional neural network architectures: a MobileNet-based model and a custom CNN model. Both models are evaluated for their ability to classify skin lesions into seven categories and to distinguish between dangerous and benign lesions. Explainability is assessed using Saliency Maps and Integrated Gradients, with results interpreted by a dermatologist. The study finds that both models generally highlight relevant features for most lesion types, although they struggle with certain classes like seborrheic keratoses and vascular lesions. Fairness is evaluated using the Equalized Odds metric across sex and skin tone groups. While both models demonstrate fairness across sex groups, they show significant disparities in false positive and false negative rates between light and dark skin tones. A Calibrated Equalized Odds postprocessing strategy is applied to mitigate these disparities, resulting in improved fairness, particularly in reducing false negative rate differences. The study concludes that while the models show promise in explainability, further development is needed to ensure fairness across different skin tones. These findings underscore the importance of rigorous evaluation of AI models in medical applications, particularly in diverse population groups.

[CV-80] Exploiting XAI maps to improve MS lesion segmentation and detection in MRI

链接: https://arxiv.org/abs/2409.03772
作者: Federico Spagnolo,Nataliia Molchanova,Mario Ocampo Pineda,Lester Melie-Garcia,Meritxell Bach Cuadra,Cristina Granziera,Vincent Andrearczyk,Adrien Depeursinge
关键词-EN: explain deep learning, deep learning algorithms, classification tasks, developed to explain, explain deep
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To date, several methods have been developed to explain deep learning algorithms for classification tasks. Recently, an adaptation of two of such methods has been proposed to generate instance-level explainable maps in a semantic segmentation scenario, such as multiple sclerosis (MS) lesion segmentation. In the mentioned work, a 3D U-Net was trained and tested for MS lesion segmentation, yielding an F1 score of 0.7006, and a positive predictive value (PPV) of 0.6265. The distribution of values in explainable maps exposed some differences between maps of true and false positive (TP/FP) examples. Inspired by those results, we explore in this paper the use of characteristics of lesion-specific saliency maps to refine segmentation and detection scores. We generate around 21000 maps from as many TP/FP lesions in a batch of 72 patients (training set) and 4868 from the 37 patients in the test set. 93 radiomic features extracted from the first set of maps were used to train a logistic regression model and classify TP versus FP. On the test set, F1 score and PPV were improved by a large margin when compared to the initial model, reaching 0.7450 and 0.7817, with 95% confidence intervals of [0.7358, 0.7547] and [0.7679, 0.7962], respectively. These results suggest that saliency maps can be used to refine prediction scores, boosting a model’s performances.

机器学习

[LG-0] Accelerating Training with Neuron Interaction and Nowcasting Networks

链接: https://arxiv.org/abs/2409.04434
作者: Boris Knyazev,Abhinav Moudgil,Guillaume Lajoie,Eugene Belilovsky,Simon Lacoste-Julien
关键词-EN: classic adaptive optimizers, learnable update rule, adaptive optimizers, learnable update, lieu of classic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: code this https URL

点击查看摘要

Abstract:Neural network training can be accelerated when a learnable update rule is used in lieu of classic adaptive optimizers (e.g. Adam). However, learnable update rules can be costly and unstable to train and use. A simpler recently proposed approach to accelerate training is to use Adam for most of the optimization steps and periodically, only every few steps, nowcast (predict future) parameters. We improve this approach by Neuron interaction and Nowcasting (NiNo) networks. NiNo leverages neuron connectivity and graph neural networks to more accurately nowcast parameters by learning in a supervised way from a set of training trajectories over multiple tasks. We show that in some networks, such as Transformers, neuron connectivity is non-trivial. By accurately modeling neuron connectivity, we allow NiNo to accelerate Adam training by up to 50% in vision and language tasks.

[LG-1] heory Analysis and Best Practices for Sigmoid Self-Attention

链接: https://arxiv.org/abs/2409.04431
作者: Jason Ramapuram,Federico Danieli,Eeshan Dhekane,Floris Weers,Dan Busbridge,Pierre Ablin,Tatiana Likhomanenko,Jagrit Digani,Zijin Gu,Amitis Shidani,Russ Webb
关键词-EN: sigmoid attention, Attention, sigmoid, softmax attention, softmax
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as the softmax of dot products between keys and queries. Recent work has explored alternatives to softmax attention in transformers, such as ReLU and sigmoid activations. In this work, we revisit sigmoid attention and conduct an in-depth theoretical and empirical analysis. Theoretically, we prove that transformers with sigmoid attention are universal function approximators and benefit from improved regularity compared to softmax attention. Through detailed empirical analysis, we identify stabilization of large initial attention norms during the early stages of training as a crucial factor for the successful training of models with sigmoid attention, outperforming prior attempts. We also introduce FLASHSIGMOID, a hardware-aware and memory-efficient implementation of sigmoid attention yielding a 17% inference kernel speed-up over FLASHATTENTION2 on H100 GPUs. Experiments across language, vision, and speech show that properly normalized sigmoid attention matches the strong performance of softmax attention on a wide range of domains and scales, which previous attempts at sigmoid attention were unable to fully achieve. Our work unifies prior art and establishes best practices for sigmoid attention as a drop-in softmax replacement in transformers.

[LG-2] VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

链接: https://arxiv.org/abs/2409.04429
作者: Yecheng Wu,Zhuoyang Zhang,Junyu Chen,Haotian Tang,Dacheng Li,Yunhao Fang,Ligeng Zhu,Enze Xie,Hongxu Yin,Li Yi,Song Han,Yao Lu
关键词-EN: integrates Video, Language understanding, visual language understanding, Video, Unified foundation model
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 11 pages, 7 figures, 8 tables

点击查看摘要

Abstract:VILA-U is a Unified foundation model that integrates Video, Image, Language understanding and generation. Traditional visual language models (VLMs) use separate modules for understanding and generating visual content, which can lead to misalignment and increased complexity. In contrast, VILA-U employs a single autoregressive next-token prediction framework for both tasks, eliminating the need for additional components like diffusion models. This approach not only simplifies the model but also achieves near state-of-the-art performance in visual language understanding and generation. The success of VILA-U is attributed to two main factors: the unified vision tower that aligns discrete visual tokens with textual inputs during pretraining, which enhances visual perception, and autoregressive image generation can achieve similar quality as diffusion models with high-quality dataset. This allows VILA-U to perform comparably to more complex models using a fully token-based autoregressive framework.

[LG-3] Hybrid Spiking Neural Networks for Low-Power Intra-Cortical Brain-Machine Interfaces

链接: https://arxiv.org/abs/2409.04428
作者: Alexandru Vasilache,Jann Krausse,Klaus Knobloch,Juergen Becker
关键词-EN: Intra-cortical brain-machine interfaces, perform daily activities, Intra-cortical brain-machine, brain-machine interfaces, daily activities
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注: This work has been accepted at the 2024 IEEE Biomedical Circuits and Systems Conference

点击查看摘要

Abstract:Intra-cortical brain-machine interfaces (iBMIs) have the potential to dramatically improve the lives of people with paraplegia by restoring their ability to perform daily activities. However, current iBMIs suffer from scalability and mobility limitations due to bulky hardware and wiring. Wireless iBMIs offer a solution but are constrained by a limited data rate. To overcome this challenge, we are investigating hybrid spiking neural networks for embedded neural decoding in wireless iBMIs. The networks consist of a temporal convolution-based compression followed by recurrent processing and a final interpolation back to the original sequence length. As recurrent units, we explore gated recurrent units (GRUs), leaky integrate-and-fire (LIF) neurons, and a combination of both - spiking GRUs (sGRUs) and analyze their differences in terms of accuracy, footprint, and activation sparsity. To that end, we train decoders on the “Nonhuman Primate Reaching with Multichannel Sensorimotor Cortex Electrophysiology” dataset and evaluate it using the NeuroBench framework, targeting both tracks of the IEEE BioCAS Grand Challenge on Neural Decoding. Our approach achieves high accuracy in predicting velocities of primate reaching movements from multichannel primary motor cortex recordings while maintaining a low number of synaptic operations, surpassing the current baseline models in the NeuroBench framework. This work highlights the potential of hybrid neural networks to facilitate wireless iBMIs with high decoding precision and a substantial increase in the number of monitored neurons, paving the way toward more advanced neuroprosthetic technologies.

[LG-4] RLPF: Reinforcement Learning from Prediction Feedback for User Summarization with LLMs

链接: https://arxiv.org/abs/2409.04421
作者: Jiaxing Wu,Lin Ning,Luyang Liu,Harrison Lee,Neo Wu,Chao Wang,Sushant Prakash,Shawn O’Banion,Bradley Green,Jun Xie
关键词-EN: Large Language Models, employ Large Language, Language Models, Large Language, systems employ Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:LLM-powered personalization agent systems employ Large Language Models (LLMs) to predict users’ behavior from their past activities. However, their effectiveness often hinges on the ability to effectively leverage extensive, long user historical data due to its inherent noise and length of such data. Existing pretrained LLMs may generate summaries that are concise but lack the necessary context for downstream tasks, hindering their utility in personalization systems. To address these challenges, we introduce Reinforcement Learning from Prediction Feedback (RLPF). RLPF fine-tunes LLMs to generate concise, human-readable user summaries that are optimized for downstream task performance. By maximizing the usefulness of the generated summaries, RLPF effectively distills extensive user history data while preserving essential information for downstream tasks. Our empirical evaluation demonstrates significant improvements in both extrinsic downstream task utility and intrinsic summary quality, surpassing baseline methods by up to 22% on downstream task performance and achieving an up to 84.59% win rate on Factuality, Abstractiveness, and Readability. RLPF also achieves a remarkable 74% reduction in context length while improving performance on 16 out of 19 unseen tasks and/or datasets, showcasing its generalizability. This approach offers a promising solution for enhancing LLM personalization by effectively transforming long, noisy user histories into informative and human-readable representations.

[LG-5] Approximating Metric Magnitude of Point Sets

链接: https://arxiv.org/abs/2409.04411
作者: Rayna Andreeva,James Ward,Primoz Skraba,Jie Gao,Rik Sarkar
关键词-EN: desirable geometric properties, geometric properties, Metric magnitude, point clouds, desirable geometric
类目: Machine Learning (cs.LG); Metric Geometry (math.MG)
*备注:

点击查看摘要

Abstract:Metric magnitude is a measure of the “size” of point clouds with many desirable geometric properties. It has been adapted to various mathematical contexts and recent work suggests that it can enhance machine learning and optimization algorithms. But its usability is limited due to the computational cost when the dataset is large or when the computation must be carried out repeatedly (e.g. in model training). In this paper, we study the magnitude computation problem, and show efficient ways of approximating it. We show that it can be cast as a convex optimization problem, but not as a submodular optimization. The paper describes two new algorithms - an iterative approximation algorithm that converges fast and is accurate, and a subset selection method that makes the computation even faster. It has been previously proposed that magnitude of model sequences generated during stochastic gradient descent is correlated to generalization gap. Extension of this result using our more scalable algorithms shows that longer sequences in fact bear higher correlations. We also describe new applications of magnitude in machine learning - as an effective regularizer for neural network training, and as a novel clustering criterion.

[LG-6] Exploiting the Data Gap: Utilizing Non-ignorable Missingness to Manipulate Model Learning

链接: https://arxiv.org/abs/2409.04407
作者: Deniz Koyuncu,Alex Gittens,Bülent Yener,Moti Yung
关键词-EN: non-ignorable missingness mechanisms, effective remediation depends, missingness mechanism, adversarial missingness mechanism, underlying missingness mechanism
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Missing data is commonly encountered in practice, and when the missingness is non-ignorable, effective remediation depends on knowledge of the missingness mechanism. Learning the underlying missingness mechanism from the data is not possible in general, so adversaries can exploit this fact by maliciously engineering non-ignorable missingness mechanisms. Such Adversarial Missingness (AM) attacks have only recently been motivated and introduced, and then successfully tailored to mislead causal structure learning algorithms into hiding specific cause-and-effect relationships. However, existing AM attacks assume the modeler (victim) uses full-information maximum likelihood methods to handle the missing data, and are of limited applicability when the modeler uses different remediation strategies. In this work we focus on associational learning in the context of AM attacks. We consider (i) complete case analysis, (ii) mean imputation, and (iii) regression-based imputation as alternative strategies used by the modeler. Instead of combinatorially searching for missing entries, we propose a novel probabilistic approximation by deriving the asymptotic forms of these methods used for handling the missing entries. We then formulate the learning of the adversarial missingness mechanism as a bi-level optimization problem. Experiments on generalized linear models show that AM attacks can be used to change the p-values of features from significant to insignificant in real datasets, such as the California-housing dataset, while using relatively moderate amounts of missingness (20%). Additionally, we assess the robustness of our attacks against defense strategies based on data valuation.

[LG-7] Gaussian-Mixture-Model Q-Functions for Reinforcement Learning by Riemannian Optimization

链接: https://arxiv.org/abs/2409.04374
作者: Minh Vu,Konstantinos Slavakis
关键词-EN: Gaussian-mixture models, Q-function losses, reinforcement learning, role for Gaussian-mixture, Q-function approximators
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper establishes a novel role for Gaussian-mixture models (GMMs) as functional approximators of Q-function losses in reinforcement learning (RL). Unlike the existing RL literature, where GMMs play their typical role as estimates of probability density functions, GMMs approximate here Q-function losses. The new Q-function approximators, coined GMM-QFs, are incorporated in Bellman residuals to promote a Riemannian-optimization task as a novel policy-evaluation step in standard policy-iteration schemes. The paper demonstrates how the hyperparameters (means and covariance matrices) of the Gaussian kernels are learned from the data, opening thus the door of RL to the powerful toolbox of Riemannian optimization. Numerical tests show that with no use of training data, the proposed design outperforms state-of-the-art methods, even deep Q-networks which use training data, on benchmark RL tasks.

[LG-8] Evaluating Fairness in Transaction Fraud Models: Fairness Metrics Bias Audits and Challenges

链接: https://arxiv.org/abs/2409.04373
作者: Parameswaran Kamalaruban,Yulu Pi,Stuart Burrell,Eleanor Drage,Piotr Skalski,Jason Wong,David Sutton
关键词-EN: fraud detection models, Ensuring fairness, fraud detection, fraud, biased decision-making
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ensuring fairness in transaction fraud detection models is vital due to the potential harms and legal implications of biased decision-making. Despite extensive research on algorithmic fairness, there is a notable gap in the study of bias in fraud detection models, mainly due to the field’s unique challenges. These challenges include the need for fairness metrics that account for fraud data’s imbalanced nature and the tradeoff between fraud protection and service quality. To address this gap, we present a comprehensive fairness evaluation of transaction fraud models using public synthetic datasets, marking the first algorithmic bias audit in this domain. Our findings reveal three critical insights: (1) Certain fairness metrics expose significant bias only after normalization, highlighting the impact of class imbalance. (2) Bias is significant in both service quality-related parity metrics and fraud protection-related parity metrics. (3) The fairness through unawareness approach, which involved removing sensitive attributes such as gender, does not improve bias mitigation within these datasets, likely due to the presence of correlated proxies. We also discuss socio-technical fairness-related challenges in transaction fraud models. These insights underscore the need for a nuanced approach to fairness in fraud detection, balancing protection and service quality, and moving beyond simple bias mitigation strategies. Future work must focus on refining fairness metrics and developing methods tailored to the unique complexities of the transaction fraud domain.

[LG-9] Provable Hyperparameter Tuning for Structured Pfaffian Settings

链接: https://arxiv.org/abs/2409.04367
作者: Maria-Florina Balcan,Anh Tuan Nguyen,Dravyansh Sharma
关键词-EN: Data-driven algorithm design, specific application domains, Data-driven algorithm, achieving better performance, algorithm design
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Data-driven algorithm design automatically adapts algorithms to specific application domains, achieving better performance. In the context of parameterized algorithms, this approach involves tuning the algorithm parameters using problem instances drawn from the problem distribution of the target application domain. While empirical evidence supports the effectiveness of data-driven algorithm design, providing theoretical guarantees for several parameterized families remains challenging. This is due to the intricate behaviors of their corresponding utility functions, which typically admit piece-wise and discontinuity structures. In this work, we present refined frameworks for providing learning guarantees for parameterized data-driven algorithm design problems in both distributional and online learning settings. For the distributional learning setting, we introduce the Pfaffian GJ framework, an extension of the classical GJ framework, capable of providing learning guarantees for function classes for which the computation involves Pfaffian functions. Unlike the GJ framework, which is limited to function classes with computation characterized by rational functions, our proposed framework can deal with function classes involving Pfaffian functions, which are much more general and widely applicable. We then show that for many parameterized algorithms of interest, their utility function possesses a refined piece-wise structure, which automatically translates to learning guarantees using our proposed framework. For the online learning setting, we provide a new tool for verifying dispersion property of a sequence of loss functions. This sufficient condition allows no-regret learning for sequences of piece-wise structured loss functions where the piece-wise structure involves Pfaffian transition boundaries.

[LG-10] A naive aggregation algorithm for improving generalization in a class of learning problems

链接: https://arxiv.org/abs/2409.04352
作者: Getachew K Befekadu
关键词-EN: expert advice setting, sequential decision-making problem, naive aggregation algorithm, model validation, advice setting
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: Brief paper, with 7 pages, 1 figure

点击查看摘要

Abstract:In this brief paper, we present a naive aggregation algorithm for a typical learning problem with expert advice setting, in which the task of improving generalization, i.e., model validation, is embedded in the learning process as a sequential decision-making problem. In particular, we consider a class of learning problem of point estimations for modeling high-dimensional nonlinear functions, where a group of experts update their parameter estimates using the discrete-time version of gradient systems, with small additive noise term, guided by the corresponding subsample datasets obtained from the original dataset. Here, our main objective is to provide conditions under which such an algorithm will sequentially determine a set of mixing distribution strategies used for aggregating the experts’ estimates that ultimately leading to an optimal parameter estimate, i.e., as a consensus solution for all experts, which is better than any individual expert’s estimate in terms of improved generalization or learning performances. Finally, as part of this work, we present some numerical results for a typical case of nonlinear regression problem.

[LG-11] AGR: Age Group fairness Reward for Bias Mitigation in LLMs

链接: https://arxiv.org/abs/2409.04340
作者: Shuirong Cao,Ruoxi Cheng,Zhiqiang Wang
关键词-EN: exhibit age biases, resulting in unequal, unequal treatment, treatment of individuals, age
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: The first two authors contributed equally to this work. Corresponding to Zhiqiang Wang. ACKNOWLEDGMENT: we would like to thank the computing resources support from the State Key Laboratory of New Computer Software Technologies at Nanjing University

点击查看摘要

Abstract:LLMs can exhibit age biases, resulting in unequal treatment of individuals across age groups. While much research has addressed racial and gender biases, age bias remains little explored. The scarcity of instruction-tuning and preference datasets for age bias hampers its detection and measurement, and existing fine-tuning methods seldom address age-related fairness. In this paper, we construct age bias preference datasets and instruction-tuning datasets for RLHF. We introduce ARG, an age fairness reward to reduce differences in the response quality of LLMs across different age groups. Extensive experiments demonstrate that this reward significantly improves response accuracy and reduces performance disparities across age groups. Our source code and datasets are available at the anonymous \hrefhttps://anonymous.4open.science/r/FairRLHF-D445/readme.mdlink.

[LG-12] A high-accuracy multi-model mixing retrosynthetic method

链接: https://arxiv.org/abs/2409.04335
作者: Shang Xiang,Lin Yao,Zhen Wang,Qifan Yu,Wentan Liu,Wentao Guo,Guolin Ke
关键词-EN: computer-aided synthesis planning, achieving significant progress, product prediction model, synthesis planning, recent years
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The field of computer-aided synthesis planning (CASP) has seen rapid advancements in recent years, achieving significant progress across various algorithmic benchmarks. However, chemists often encounter numerous infeasible reactions when using CASP in practice. This article delves into common errors associated with CASP and introduces a product prediction model aimed at enhancing the accuracy of single-step models. While the product prediction model reduces the number of single-step reactions, it integrates multiple single-step models to maintain the overall reaction count and increase reaction diversity. Based on manual analysis and large-scale testing, the product prediction model, combined with the multi-model ensemble approach, has been proven to offer higher feasibility and greater diversity.

[LG-13] Amortized Bayesian Workflow (Extended Abstract)

链接: https://arxiv.org/abs/2409.04332
作者: Marvin Schmitt,Chengkun Li,Aki Vehtari,Luigi Acerbi,Paul-Christian Bürkner,Stefan T. Radev
关键词-EN: Bayesian inference, faces a trade-off, trade-off between computational, computational speed, Bayesian
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Extended Abstract

点击查看摘要

Abstract:Bayesian inference often faces a trade-off between computational speed and sampling accuracy. We propose an adaptive workflow that integrates rapid amortized inference with gold-standard MCMC techniques to achieve both speed and accuracy when performing inference on many observed datasets. Our approach uses principled diagnostics to guide the choice of inference method for each dataset, moving along the Pareto front from fast amortized sampling to slower but guaranteed-accurate MCMC when necessary. By reusing computations across steps, our workflow creates synergies between amortized and MCMC-based inference. We demonstrate the effectiveness of this integrated approach on a generalized extreme value task with 1000 observed data sets, showing 90x time efficiency gains while maintaining high posterior quality.

[LG-14] Active learning for regression in engineering populations: A risk-informed approach

链接: https://arxiv.org/abs/2409.04328
作者: Daniel R. Clarkson,Lawrence A. Bull,Chandula T. Wickramarachchi,Elizabeth J. Cross,Timothy J. Rogers,Keith Worden,Nikolaos Dervilis,Aidan J. Hughes
关键词-EN: involves learning mappings, fundamental prediction task, prediction task common, continuous variables, fundamental prediction
类目: Machine Learning (cs.LG)
*备注: 19 pages, 12 figures, 3 tables, submitted to Data-Centric Engineering

点击查看摘要

Abstract:Regression is a fundamental prediction task common in data-centric engineering applications that involves learning mappings between continuous variables. In many engineering applications (e.g.\ structural health monitoring), feature-label pairs used to learn such mappings are of limited availability which hinders the effectiveness of traditional supervised machine learning approaches. The current paper proposes a methodology for overcoming the issue of data scarcity by combining active learning with hierarchical Bayesian modelling. Active learning is an approach for preferentially acquiring feature-label pairs in a resource-efficient manner. In particular, the current work adopts a risk-informed approach that leverages contextual information associated with regression-based engineering decision-making tasks (e.g.\ inspection and maintenance). Hierarchical Bayesian modelling allow multiple related regression tasks to be learned over a population, capturing local and global effects. The information sharing facilitated by this modelling approach means that information acquired for one engineering system can improve predictive performance across the population. The proposed methodology is demonstrated using an experimental case study. Specifically, multiple regressions are performed over a population of machining tools, where the quantity of interest is the surface roughness of the workpieces. An inspection and maintenance decision process is defined using these regression tasks which is in turn used to construct the active-learning algorithm. The novel methodology proposed is benchmarked against an uninformed approach to label acquisition and independent modelling of the regression tasks. It is shown that the proposed approach has superior performance in terms of expected cost – maintaining predictive performance while reducing the number of inspections required. Comments: 19 pages, 12 figures, 3 tables, submitted to Data-Centric Engineering Subjects: Machine Learning (cs.LG) Cite as: arXiv:2409.04328 [cs.LG] (or arXiv:2409.04328v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.04328 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-15] Faster Sampling from Log-Concave Densities over Polytopes via Efficient Linear Solvers ICLR2024

链接: https://arxiv.org/abs/2409.04320
作者: Oren Mangoubi,Nisheeth K. Vishnoi
关键词-EN: Markov chain step, matrix multiplication constant, step requires computing, chain step requires, Markov chain
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: The conference version of this paper appears in ICLR 2024

点击查看摘要

Abstract:We consider the problem of sampling from a log-concave distribution \pi(\theta) \propto e^-f(\theta) constrained to a polytope K:=\theta \in \mathbbR^d: A\theta \leq b\ , where A\in \mathbbR^m\times d and b \in \mathbbR^m .The fastest-known algorithm \citemangoubi2022faster for the setting when f is O(1) -Lipschitz or O(1) -smooth runs in roughly O(md \times md^\omega -1) arithmetic operations, where the md^\omega -1 term arises because each Markov chain step requires computing a matrix inversion and determinant (here \omega \approx 2.37 is the matrix multiplication constant). We present a nearly-optimal implementation of this Markov chain with per-step complexity which is roughly the number of non-zero entries of A while the number of Markov chain steps remains the same. The key technical ingredients are 1) to show that the matrices that arise in this Dikin walk change slowly, 2) to deploy efficient linear solvers that can leverage this slow change to speed up matrix inversion by using information computed in previous steps, and 3) to speed up the computation of the determinantal term in the Metropolis filter step via a randomized Taylor series-based estimator.

[LG-16] Enhancing Uncertainty Quantification in Drug Discovery with Censored Regression Labels

链接: https://arxiv.org/abs/2409.04313
作者: Emma Svensson,Hannah Rosa Friesacher,Susanne Winiwarter,Lewis Mervin,Adam Arany,Ola Engkvist
关键词-EN: early stages, censored labels, drug discovery, censored, labels
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the early stages of drug discovery, decisions regarding which experiments to pursue can be influenced by computational models. These decisions are critical due to the time-consuming and expensive nature of the experiments. Therefore, it is becoming essential to accurately quantify the uncertainty in machine learning predictions, such that resources can be used optimally and trust in the models improves. While computational methods for drug discovery often suffer from limited data and sparse experimental observations, additional information can exist in the form of censored labels that provide thresholds rather than precise values of observations. However, the standard approaches that quantify uncertainty in machine learning cannot fully utilize censored labels. In this work, we adapt ensemble-based, Bayesian, and Gaussian models with tools to learn from censored labels by using the Tobit model from survival analysis. Our results demonstrate that despite the partial information available in censored labels, they are essential to accurately and reliably model the real pharmaceutical setting.

[LG-17] A Unified Approach to Inferring Chemical Compounds with the Desired Aqueous Solubility

链接: https://arxiv.org/abs/2409.04301
作者: Muniba Batool,Naveed Ahmed Azam,Jianshen Zhu,Kazuya Haraguchi,Liang Zhao,Tatsuya Akutsu
关键词-EN: key physiochemical property, Aqueous solubility, material design, key physiochemical, physiochemical property
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM)
*备注:

点击查看摘要

Abstract:Aqueous solubility (AS) is a key physiochemical property that plays a crucial role in drug discovery and material design. We report a novel unified approach to predict and infer chemical compounds with the desired AS based on simple deterministic graph-theoretic descriptors, multiple linear regression (MLR) and mixed integer linear programming (MILP). Selected descriptors based on a forward stepwise procedure enabled the simplest regression model, MLR, to achieve significantly good prediction accuracy compared to the existing approaches, achieving the accuracy in the range [0.7191, 0.9377] for 29 diverse datasets. By simulating these descriptors and learning models as MILPs, we inferred mathematically exact and optimal compounds with the desired AS, prescribed structures, and up to 50 non-hydrogen atoms in a reasonable time range [6, 1204] seconds. These findings indicate a strong correlation between the simple graph-theoretic descriptors and the AS of compounds, potentially leading to a deeper understanding of their AS without relying on widely used complicated chemical descriptors and complex machine learning models that are computationally expensive, and therefore difficult to use for inference. An implementation of the proposed approach is available at this https URL.

[LG-18] CoxKAN: Kolmogorov-Arnold Networks for Interpretable High-Performance Survival Analysis

链接: https://arxiv.org/abs/2409.04290
作者: William Knottenbelt,Zeyu Gao,Rebecca Wray,Woody Zhidong Zhang,Jiashuai Liu,Mireia Crispin-Ortuzar
关键词-EN: specific event occurs, branch of statistics, modeling the time, specific event, event occurs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Survival analysis is a branch of statistics used for modeling the time until a specific event occurs and is widely used in medicine, engineering, finance, and many other fields. When choosing survival models, there is typically a trade-off between performance and interpretability, where the highest performance is achieved by black-box models based on deep learning. This is a major problem in fields such as medicine where practitioners are reluctant to blindly trust black-box models to make important patient decisions. Kolmogorov-Arnold Networks (KANs) were recently proposed as an interpretable and accurate alternative to multi-layer perceptrons (MLPs). We introduce CoxKAN, a Cox proportional hazards Kolmogorov-Arnold Network for interpretable, high-performance survival analysis. We evaluate the proposed CoxKAN on 4 synthetic datasets and 9 real medical datasets. The synthetic experiments demonstrate that CoxKAN accurately recovers interpretable symbolic formulae for the hazard function, and effectively performs automatic feature selection. Evaluation on the 9 real datasets show that CoxKAN consistently outperforms the Cox proportional hazards model and achieves performance that is superior or comparable to that of tuned MLPs. Furthermore, we find that CoxKAN identifies complex interactions between predictor variables that would be extremely difficult to recognise using existing survival methods, and automatically finds symbolic formulae which uncover the precise effect of important biomarkers on patient risk.

[LG-19] AttentionX: Exploiting Consensus Discrepancy In Attention from A Distributed Optimization Perspective

链接: https://arxiv.org/abs/2409.04275
作者: Guoqiang Zhang,Richard Heusdens
关键词-EN: distributed optimization perspective, distributed optimization, distributed optimization algorithm, popular distributed optimization, distributed optimization problems
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we extend the standard Attention in transformer by exploiting the consensus discrepancy from a distributed optimization perspective, referred to as AttentionX. It is noted that %the popular distributed optimization algorithm \citeBoyd11ADMM and the primal-dual method of multipliers (PDMM) \citeZhang16PDMM is designed to iteratively solve a broad class of distributed optimization problems over a pear-to-pear (P2P) network, where neighbouring nodes gradually reach consensus as specified by predefined linear edge-constraints in the optimization process. In particular, at each iteration of PDMM, each node in a network first performs information-gathering from neighbours and then performs local information-fusion. From a high-level point of view, the KQ -softmax-based weighted summation of V -representations in Attention corresponds information-gathering from neighbours while the feature-processing via the feed-forward network (FFN) in transformer corresponds to local information fusion. PDMM exploits the Lagrangian multipliers to capture the historical consensus discrepancy in the form of residual errors of the linear edge-constraints, which plays a crucial role for the algorithm to converge. Inspired by PDMM, we propose AttentionX to incorporate the consensus discrepancy in the output update-expression of the standard Attention. The consensus discrepancy in AttentionX refers to the difference between the weighted summation of V -representations and scaled V -representions themselves. Experiments on ViT and nanoGPT show promising performance.

[LG-20] Hermes: Memory-Efficient Pipeline Inference for Large Models on Edge Devices

链接: https://arxiv.org/abs/2409.04249
作者: Xueyuan Han,Zinuo Cai,Yichu Zhang,Chongxin Fan,Junhan Liu,Ruhui Ma,Rajkumar Buyya
关键词-EN: achieved numerous success, recent years, achieved numerous, numerous success, success in recent
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by the 42nd IEEE International Conference on Computer Design (ICCD 2024)

点击查看摘要

Abstract:The application of Transformer-based large models has achieved numerous success in recent years. However, the exponential growth in the parameters of large models introduces formidable memory challenge for edge deployment. Prior works to address this challenge mainly focus on optimizing the model structure and adopting memory swapping methods. However, the former reduces the inference accuracy, and the latter raises the inference latency. This paper introduces PIPELOAD, a novel memory-efficient pipeline execution mechanism. It reduces memory usage by incorporating dynamic memory management and minimizes inference latency by employing parallel model loading. Based on PIPELOAD mechanism, we present Hermes, a framework optimized for large model inference on edge devices. We evaluate Hermes on Transformer-based models of different sizes. Our experiments illustrate that Hermes achieves up to 4.24 X increase in inference speed and 86.7% lower memory consumption than the state-of-the-art pipeline mechanism for BERT and ViT models, 2.58 X increase in inference speed and 90.3% lower memory consumption for GPT-style models.

[LG-21] WarpAdam: A new Adam optimizer based on Meta-Learning approach

链接: https://arxiv.org/abs/2409.04244
作者: Chengxi Pan,Junshang Chen,Jingrui Ye
关键词-EN: Adam optimizer, Adam, algorithms is crucial, optimizer, Meta Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Optimal selection of optimization algorithms is crucial for training deep learning models. The Adam optimizer has gained significant attention due to its efficiency and wide applicability. However, to enhance the adaptability of optimizers across diverse datasets, we propose an innovative optimization strategy by integrating the 'warped gradient descend’concept from Meta Learning into the Adam optimizer. In the conventional Adam optimizer, gradients are utilized to compute estimates of gradient mean and variance, subsequently updating model parameters. Our approach introduces a learnable distortion matrix, denoted as P, which is employed for linearly transforming gradients. This transformation slightly adjusts gradients during each iteration, enabling the optimizer to better adapt to distinct dataset characteristics. By learning an appropriate distortion matrix P, our method aims to adaptively adjust gradient information across different data distributions, thereby enhancing optimization performance. Our research showcases the potential of this novel approach through theoretical insights and empirical evaluations. Experimental results across various tasks and datasets validate the superiority of our optimizer that integrates the ‘warped gradient descend’ concept in terms of adaptability. Furthermore, we explore effective strategies for training the adaptation matrix P and identify scenarios where this method can yield optimal results. In summary, this study introduces an innovative approach that merges the ‘warped gradient descend’ concept from Meta Learning with the Adam optimizer. By introducing a learnable distortion matrix P within the optimizer, we aim to enhance the model’s generalization capability across diverse data distributions, thus opening up new possibilities in the field of deep learning optimization.

[LG-22] Unmasking Covert Intrusions: Detection of Fault-Masking Cyberattacks on Differential Protection Systems

链接: https://arxiv.org/abs/2409.04242
作者: Ahmad Mohammad Saber,Amr Youssef,Davor Svetinovic,Hatem Zeineldin,Ehab F. El-Saadany
关键词-EN: Current Differential Relays, Line Current Differential, high-speed relays progressively, Current Differential, Differential Relays
类目: ystems and Control (eess.SY); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Accepted to IEEE Transactions on Systems, Man, and Cybernetics: Systems. \c{opyright} 2024 IEEE

点击查看摘要

Abstract:Line Current Differential Relays (LCDRs) are high-speed relays progressively used to protect critical transmission lines. However, LCDRs are vulnerable to cyberattacks. Fault-Masking Attacks (FMAs) are stealthy cyberattacks performed by manipulating the remote measurements of the targeted LCDR to disguise faults on the protected line. Hence, they remain undetected by this LCDR. In this paper, we propose a two-module framework to detect FMAs. The first module is a Mismatch Index (MI) developed from the protected transmission line’s equivalent physical model. The MI is triggered only if there is a significant mismatch in the LCDR’s local and remote measurements while the LCDR itself is untriggered, which indicates an FMA. After the MI is triggered, the second module, a neural network-based classifier, promptly confirms that the triggering event is a physical fault that lies on the line protected by the LCDR before declaring the occurrence of an FMA. The proposed framework is tested using the IEEE 39-bus benchmark system. Our simulation results confirm that the proposed framework can accurately detect FMAs on LCDRs and is not affected by normal system disturbances, variations, or measurement noise. Our experimental results using OPAL-RT’s real-time simulator confirm the proposed solution’s real-time performance capability.

[LG-23] Calibration of Network Confidence for Unsupervised Domain Adaptation Using Estimated Accuracy

链接: https://arxiv.org/abs/2409.04241
作者: Coby Penso,Jacob Goldberger
关键词-EN: target domain, target domain makes, calibrating network confidence, target, domain
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This study addresses the problem of calibrating network confidence while adapting a model that was originally trained on a source domain to a target domain using unlabeled samples from the target domain. The absence of labels from the target domain makes it impossible to directly calibrate the adapted network on the target domain. To tackle this challenge, we introduce a calibration procedure that relies on estimating the network’s accuracy on the target domain. The network accuracy is first computed on the labeled source data and then is modified to represent the actual accuracy of the model on the target domain. The proposed algorithm calibrates the prediction confidence directly in the target domain by minimizing the disparity between the estimated accuracy and the computed confidence. The experimental results show that our method significantly outperforms existing methods, which rely on importance weighting, across several standard datasets.

[LG-24] Advancing Multi-Organ Disease Care: A Hierarchical Multi-Agent Reinforcement Learning Framework

链接: https://arxiv.org/abs/2409.04224
作者: Daniel J. Tan,Qianyi Xu,Kay Choong See,Dilruk Perera,Mengling Feng
关键词-EN: significant challenges due, present significant challenges, diseases present significant, multiple organ systems, present significant
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-organ diseases present significant challenges due to their simultaneous impact on multiple organ systems, necessitating complex and adaptive treatment strategies. Despite recent advancements in AI-powered healthcare decision support systems, existing solutions are limited to individual organ systems. They often ignore the intricate dependencies between organ system and thereby fails to provide holistic treatment recommendations that are useful in practice. We propose a novel hierarchical multi-agent reinforcement learning (HMARL) framework to address these challenges. This framework uses dedicated agents for each organ system, and model dynamic through explicit inter-agent communication channels, enabling coordinated treatment strategies across organs. Furthermore, we introduce a dual-layer state representation technique to contextualize patient conditions at various hierarchical levels, enhancing the treatment accuracy and relevance. Through extensive qualitative and quantitative evaluations in managing sepsis (a complex multi-organ disease), our approach demonstrates its ability to learn effective treatment policies that significantly improve patient survival rates. This framework marks a substantial advancement in clinical decision support systems, pioneering a comprehensive approach for multi-organ treatment recommendations.

[LG-25] Fast Forwarding Low-Rank Training

链接: https://arxiv.org/abs/2409.04206
作者: Adir Rahamim,Naomi Saphra,Sara Kangaslahti,Yonatan Belinkov
关键词-EN: pretrained Language Models, Fast Forward, finetuning pretrained Language, Parameter efficient finetuning, pretrained Language
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Parameter efficient finetuning methods like low-rank adaptation (LoRA) aim to reduce the computational costs of finetuning pretrained Language Models (LMs). Enabled by these low-rank settings, we propose an even more efficient optimization strategy: Fast Forward, a simple and effective approach to accelerate large segments of training. In a Fast Forward stage, we repeat the most recent optimizer step until the loss stops improving on a tiny validation set. By alternating between regular optimization steps and Fast Forward stages, Fast Forward provides up to an 87% reduction in FLOPs and up to an 81% reduction in train time over standard SGD with Adam. We validate Fast Forward by finetuning various models on different tasks and demonstrate that it speeds up training without compromising model performance. Additionally, we analyze when and how to apply Fast Forward.

[LG-26] owards Privacy-Preserving Relational Data Synthesis via Probabilistic Relational Models

链接: https://arxiv.org/abs/2409.04194
作者: Malte Luttermann,Ralf Möller,Mattis Hartwig
关键词-EN: combine first-order logic, relational, provide a well-established, well-established formalism, formalism to combine
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
*备注: Accepted to the Proceedings of the 47th German Conference on Artificial Intelligence (KI 2024)

点击查看摘要

Abstract:Probabilistic relational models provide a well-established formalism to combine first-order logic and probabilistic models, thereby allowing to represent relationships between objects in a relational domain. At the same time, the field of artificial intelligence requires increasingly large amounts of relational training data for various machine learning tasks. Collecting real-world data, however, is often challenging due to privacy concerns, data protection regulations, high costs, and so on. To mitigate these challenges, the generation of synthetic data is a promising approach. In this paper, we solve the problem of generating synthetic relational data via probabilistic relational models. In particular, we propose a fully-fledged pipeline to go from relational database to probabilistic relational model, which can then be used to sample new synthetic relational data points from its underlying probability distribution. As part of our proposed pipeline, we introduce a learning algorithm to construct a probabilistic relational model from a given relational database.

[LG-27] Reassessing the Validity of Spurious Correlations Benchmarks

链接: https://arxiv.org/abs/2409.04188
作者: Samuel J. Bell,Diane Bouchacourt,Levent Sagun
关键词-EN: Neural networks, spurious correlations, networks can fail, numerous spurious correlations, spurious correlations benchmarks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural networks can fail when the data contains spurious correlations. To understand this phenomenon, researchers have proposed numerous spurious correlations benchmarks upon which to evaluate mitigation methods. However, we observe that these benchmarks exhibit substantial disagreement, with the best methods on one benchmark performing poorly on another. We explore this disagreement, and examine benchmark validity by defining three desiderata that a benchmark should satisfy in order to meaningfully evaluate methods. Our results have implications for both benchmarks and mitigations: we find that certain benchmarks are not meaningful measures of method performance, and that several methods are not sufficiently robust for widespread use. We present a simple recipe for practitioners to choose methods using the most similar benchmark to their given problem.

[LG-28] Residual Stream Analysis with Multi-Layer SAEs

链接: https://arxiv.org/abs/2409.04185
作者: Tim Lawson,Lucy Farnik,Conor Houghton,Laurence Aitchison
关键词-EN: Sparse autoencoders, approach to interpreting, interpreting the internal, internal representations, single SAE
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: 16 pages, 12 figures

点击查看摘要

Abstract:Sparse autoencoders (SAEs) are a promising approach to interpreting the internal representations of transformer language models. However, standard SAEs are trained separately on each transformer layer, making it difficult to use them to study how information flows across layers. To solve this problem, we introduce the multi-layer SAE (MLSAE): a single SAE trained on the residual stream activation vectors from every transformer layer simultaneously. The residual stream is usually understood as preserving information across layers, so we expected to, and did, find individual SAE features that are active at multiple layers. Interestingly, while a single SAE feature is active at different layers for different prompts, for a single prompt, we find that a single feature is far more likely to be active at a single layer. For larger underlying models, we find that the cosine similarities between adjacent layers in the residual stream are higher, so we expect more features to be active at multiple layers. These results show that MLSAEs are a promising method to study information flow in transformers. We release our code to train and analyze MLSAEs at this https URL.

[LG-29] he Prevalence of Neural Collapse in Neural Multivariate Regression

链接: https://arxiv.org/abs/2409.04180
作者: George Andriopoulos,Zixuan Dong,Li Guo,Zifan Zhao,Keith Ross
关键词-EN: last-layer feature vectors, Neural Collapse, exhibit Neural Collapse, feature vectors collapse, Neural Regression Collapse
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently it has been observed that neural networks exhibit Neural Collapse (NC) during the final stage of training for the classification problem. We empirically show that multivariate regression, as employed in imitation learning and other applications, exhibits Neural Regression Collapse (NRC), a new form of neural collapse: (NRC1) The last-layer feature vectors collapse to the subspace spanned by the n principal components of the feature vectors, where n is the dimension of the targets (for univariate regression, n=1 ); (NRC2) The last-layer feature vectors also collapse to the subspace spanned by the last-layer weight vectors; (NRC3) The Gram matrix for the weight vectors converges to a specific functional form that depends on the covariance matrix of the targets. After empirically establishing the prevalence of (NRC1)-(NRC3) for a variety of datasets and network architectures, we provide an explanation of these phenomena by modeling the regression task in the context of the Unconstrained Feature Model (UFM), in which the last layer feature vectors are treated as free variables when minimizing the loss function. We show that when the regularization parameters in the UFM model are strictly positive, then (NRC1)-(NRC3) also emerge as solutions in the UFM optimization problem. We also show that if the regularization parameters are equal to zero, then there is no collapse. To our knowledge, this is the first empirical and theoretical study of neural collapse in the context of regression. This extension is significant not only because it broadens the applicability of neural collapse to a new category of problems but also because it suggests that the phenomena of neural collapse could be a universal behavior in deep learning.

[LG-30] owards Measuring Sell Side Outcomes in Buy Side Marketplace Experiments using In-Experiment Bipartite Graph KDD2024

链接: https://arxiv.org/abs/2409.04174
作者: Vaiva Pilkauskaitė,Jevgenij Gamper,Rasa Giniūnaitė,Agne Reklaitė
关键词-EN: online controlled bipartite, real marketplace setting, causal inference estimators, controlled bipartite graph, bipartite graph
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 5 pages, 3 figures, this work was presented at the KDD 2024 Conference Undergraduate Consortium

点击查看摘要

Abstract:In this study, we evaluate causal inference estimators for online controlled bipartite graph experiments in a real marketplace setting. Our novel contribution is constructing a bipartite graph using in-experiment data, rather than relying on prior knowledge or historical data, the common approach in the literature published to date. We build the bipartite graph from various interactions between buyers and sellers in the marketplace, establishing a novel research direction at the intersection of bipartite experiments and mediation analysis. This approach is crucial for modern marketplaces aiming to evaluate seller-side causal effects in buyer-side experiments, or vice versa. We demonstrate our method using historical buyer-side experiments conducted at Vinted, the largest second-hand marketplace in Europe with over 80M users.

[LG-31] Can OpenSource beat ChatGPT? – A Comparative Study of Large Language Models for Text-to-Code Generation

链接: https://arxiv.org/abs/2409.04164
作者: Luis Mayer,Christian Heumann,Matthias Aßenmacher
关键词-EN: including software engineering, large language models, recent years, including software, software engineering
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Conference Paper accepted at the 9th SwissText Conference (2024)

点击查看摘要

Abstract:In recent years, large language models (LLMs) have emerged as powerful tools with potential applications in various fields, including software engineering. Within the scope of this research, we evaluate five different state-of-the-art LLMs - Bard, BingChat, ChatGPT, Llama2, and Code Llama - concerning their capabilities for text-to-code generation. In an empirical study, we feed prompts with textual descriptions of coding problems sourced from the programming website LeetCode to the models with the task of creating solutions in Python. Subsequently, the quality of the generated outputs is assessed using the testing functionalities of LeetCode. The results indicate large differences in performance between the investigated models. ChatGPT can handle these typical programming challenges by far the most effectively, surpassing even code-specialized models like Code Llama. To gain further insights, we measure the runtime as well as the memory usage of the generated outputs and compared them to the other code submissions on Leetcode. A detailed error analysis, encompassing a comparison of the differences concerning correct indentation and form of the generated code as well as an assignment of the incorrectly solved tasks to certain error categories allows us to obtain a more nuanced picture of the results and potential for improvement. The results also show a clear pattern of increasingly incorrect produced code when the models are facing a lot of context in the form of longer prompts.

[LG-32] CUQ-GNN: Committee-based Graph Uncertainty Quantification using Posterior Networks KDD2024 ECML

链接: https://arxiv.org/abs/2409.04159
作者: Clemens Damke,Eyke Hüllermeier
关键词-EN: Graph Neural Networks, Graph Posterior Network, Neural Networks, Graph Neural, Quantification Graph Neural
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 17 pages, 4 figures, 1 table. Accepted at ECML PKDD 2024. arXiv admin note: substantial text overlap with arXiv:2406.04041

点击查看摘要

Abstract:In this work, we study the influence of domain-specific characteristics when defining a meaningful notion of predictive uncertainty on graph data. Previously, the so-called Graph Posterior Network (GPN) model has been proposed to quantify uncertainty in node classification tasks. Given a graph, it uses Normalizing Flows (NFs) to estimate class densities for each node independently and converts those densities into Dirichlet pseudo-counts, which are then dispersed through the graph using the personalized Page-Rank algorithm. The architecture of GPNs is motivated by a set of three axioms on the properties of its uncertainty estimates. We show that those axioms are not always satisfied in practice and therefore propose the family of Committe-based Uncertainty Quantification Graph Neural Networks (CUQ-GNNs), which combine standard Graph Neural Networks with the NF-based uncertainty estimation of Posterior Networks (PostNets). This approach adapts more flexibly to domain-specific demands on the properties of uncertainty estimates. We compare CUQ-GNN against GPN and other uncertainty quantification approaches on common node classification benchmarks and show that it is effective at producing useful uncertainty estimates.

[LG-33] Active-Passive Federated Learning for Vertically Partitioned Multi-view Data

链接: https://arxiv.org/abs/2409.04111
作者: Jiyuan Liu,Xinwang Liu,Siqi Wang,Xingchen Hu,Qing Liao,Xinhang Wan,Yi Zhang,Xin Lv,Kunlun He
关键词-EN: integrate multi-view data, multi-view data vertically, data vertically partitioned, Vertical federated learning, Vertical federated
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vertical federated learning is a natural and elegant approach to integrate multi-view data vertically partitioned across devices (clients) while preserving their privacies. Apart from the model training, existing methods requires the collaboration of all clients in the model inference. However, the model inference is probably maintained for service in a long time, while the collaboration, especially when the clients belong to different organizations, is unpredictable in real-world scenarios, such as concellation of contract, network unavailablity, etc., resulting in the failure of them. To address this issue, we, at the first attempt, propose a flexible Active-Passive Federated learning (APFed) framework. Specifically, the active client is the initiator of a learning task and responsible to build the complete model, while the passive clients only serve as assistants. Once the model built, the active client can make inference independently. In addition, we instance the APFed framework into two classification methods with employing the reconstruction loss and the contrastive loss on passive clients, respectively. Meanwhile, the two methods are tested in a set of experiments and achieves desired results, validating their effectiveness.

[LG-34] Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100 NLP Researchers

链接: https://arxiv.org/abs/2409.04109
作者: Chenglei Si,Diyi Yang,Tatsunori Hashimoto
关键词-EN: large language models, accelerate scientific discovery, Recent advancements, works proposing research, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: main paper is 20 pages

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have sparked optimism about their potential to accelerate scientific discovery, with a growing number of works proposing research agents that autonomously generate and validate new ideas. Despite this, no evaluations have shown that LLM systems can take the very first step of producing novel, expert-level ideas, let alone perform the entire research process. We address this by establishing an experimental design that evaluates research idea generation while controlling for confounders and performs the first head-to-head comparison between expert NLP researchers and an LLM ideation agent. By recruiting over 100 NLP researchers to write novel ideas and blind reviews of both LLM and human ideas, we obtain the first statistically significant conclusion on current LLM capabilities for research ideation: we find LLM-generated ideas are judged as more novel (p 0.05) than human expert ideas while being judged slightly weaker on feasibility. Studying our agent baselines closely, we identify open problems in building and evaluating research agents, including failures of LLM self-evaluation and their lack of diversity in generation. Finally, we acknowledge that human judgements of novelty can be difficult, even by experts, and propose an end-to-end study design which recruits researchers to execute these ideas into full projects, enabling us to study whether these novelty and feasibility judgements result in meaningful differences in research outcome.

[LG-35] MixNet: Joining Force of Classical and Modern Approaches Toward the Comprehensive Pipeline in Motor Imagery EEG Classification

链接: https://arxiv.org/abs/2409.04104
作者: Phairot Autthasan,Rattanaphon Chaisaen,Huy Phan,Maarten De Vos,Theerawit Wilaiprasitporn
关键词-EN: impacted motor imagery, significantly impacted motor, Recent advances, based brain-computer interface, motor imagery
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
*备注: Supplementary materials and source codes are available on-line at this https URL

点击查看摘要

Abstract:Recent advances in deep learning (DL) have significantly impacted motor imagery (MI)-based brain-computer interface (BCI) systems, enhancing the decoding of electroencephalography (EEG) signals. However, most studies struggle to identify discriminative patterns across subjects during MI tasks, limiting MI classification performance. In this article, we propose MixNet, a novel classification framework designed to overcome this limitation by utilizing spectral-spatial signals from MI data, along with a multitask learning architecture named MIN2Net, for classification. Here, the spectral-spatial signals are generated using the filter-bank common spatial patterns (FBCSPs) method on MI data. Since the multitask learning architecture is used for the classification task, the learning in each task may exhibit different generalization rates and potential overfitting across tasks. To address this issue, we implement adaptive gradient blending, simultaneously regulating multiple loss weights and adjusting the learning pace for each task based on its generalization/overfitting tendencies. Experimental results on six benchmark data sets of different data sizes demonstrate that MixNet consistently outperforms all state-of-the-art algorithms in subject-dependent and -independent settings. Finally, the low-density EEG MI classification results show that MixNet outperforms all state-of-the-art algorithms, offering promising implications for Internet of Thing (IoT) applications, such as lightweight and portable EEG wearable devices based on low-density montages.

[LG-36] he Role of Graph Topology in the Performance of Biomedical Knowledge Graph Completion Models

链接: https://arxiv.org/abs/2409.04103
作者: Alberto Cattaneo,Stephen Bonner,Thomas Martynec,Carlo Luschi,Ian P Barrett,Daniel Justus
关键词-EN: Knowledge Graph Completion, Knowledge Graph Embedding, Graph Embedding models, Knowledge Graph, Graph Completion
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Knowledge Graph Completion has been increasingly adopted as a useful method for several tasks in biomedical research, like drug repurposing or drug-target identification. To that end, a variety of datasets and Knowledge Graph Embedding models has been proposed over the years. However, little is known about the properties that render a dataset useful for a given task and, even though theoretical properties of Knowledge Graph Embedding models are well understood, their practical utility in this field remains controversial. We conduct a comprehensive investigation into the topological properties of publicly available biomedical Knowledge Graphs and establish links to the accuracy observed in real-world applications. By releasing all model predictions and a new suite of analysis tools we invite the community to build upon our work and continue improving the understanding of these crucial applications.

[LG-37] Ultra-imbalanced classification guided by statistical information

链接: https://arxiv.org/abs/2409.04101
作者: Yin Jin,Ningtao Wang,Ruofan Wu,Pengfei Shi,Xing Fu,Weiqiang Wang
关键词-EN: real-world classification tasks, frequently encountered, encountered in real-world, classification tasks, UIC
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Imbalanced data are frequently encountered in real-world classification tasks. Previous works on imbalanced learning mostly focused on learning with a minority class of few samples. However, the notion of imbalance also applies to cases where the minority class contains abundant samples, which is usually the case for industrial applications like fraud detection in the area of financial risk management. In this paper, we take a population-level approach to imbalanced learning by proposing a new formulation called \emphultra-imbalanced classification (UIC). Under UIC, loss functions behave differently even if infinite amount of training samples are available. To understand the intrinsic difficulty of UIC problems, we borrow ideas from information theory and establish a framework to compare different loss functions through the lens of statistical information. A novel learning objective termed Tunable Boosting Loss is developed which is provably resistant against data imbalance under UIC, as well as being empirically efficient verified by extensive experimental studies on both public and industrial datasets.

[LG-38] UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity

链接: https://arxiv.org/abs/2409.04081
作者: Yicheng Fu,Raviteja Anantha,Prabal Vashisht,Jianpeng Cheng,Etai Littwin
关键词-EN: Generating user intent, Generating user, intent, user intent, Generating
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generating user intent from a sequence of user interface (UI) actions is a core challenge in comprehensive UI understanding. Recent advancements in multimodal large language models (MLLMs) have led to substantial progress in this area, but their demands for extensive model parameters, computing power, and high latency makes them impractical for scenarios requiring lightweight, on-device solutions with low latency or heightened privacy. Additionally, the lack of high-quality datasets has hindered the development of such lightweight models. To address these challenges, we propose UI-JEPA, a novel framework that employs masking strategies to learn abstract UI embeddings from unlabeled data through self-supervised learning, combined with an LLM decoder fine-tuned for user intent prediction. We also introduce two new UI-grounded multimodal datasets, “Intent in the Wild” (IIW) and “Intent in the Tame” (IIT), designed for few-shot and zero-shot UI understanding tasks. IIW consists of 1.7K videos across 219 intent categories, while IIT contains 914 videos across 10 categories. We establish the first baselines for these datasets, showing that representations learned using a JEPA-style objective, combined with an LLM decoder, can achieve user intent predictions that match the performance of state-of-the-art large MLLMs, but with significantly reduced annotation and deployment resources. Measured by intent similarity scores, UI-JEPA outperforms GPT-4 Turbo and Claude 3.5 Sonnet by 10.0% and 7.2% respectively, averaged across two datasets. Notably, UI-JEPA accomplishes the performance with a 50.5x reduction in computational cost and a 6.6x improvement in latency in the IIW dataset. These results underscore the effectiveness of UI-JEPA, highlighting its potential for lightweight, high-performance UI understanding.

[LG-39] Online Residual Learning from Offline Experts for Pedestrian Tracking

链接: https://arxiv.org/abs/2409.04069
作者: Anastasios Vlachos,Anastasios Tsiamis,Aren Karapetyan,Efe C. Balta,John Lygeros
关键词-EN: predicting unknown targets, problem of predicting, predicting unknown, Online Residual Learning, ORL
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Accepted to CDC 2024

点击查看摘要

Abstract:In this paper, we consider the problem of predicting unknown targets from data. We propose Online Residual Learning (ORL), a method that combines online adaptation with offline-trained predictions. At a lower level, we employ multiple offline predictions generated before or at the beginning of the prediction horizon. We augment every offline prediction by learning their respective residual error concerning the true target state online, using the recursive least squares algorithm. At a higher level, we treat the augmented lower-level predictors as experts, adopting the Prediction with Expert Advice framework. We utilize an adaptive softmax weighting scheme to form an aggregate prediction and provide guarantees for ORL in terms of regret. We employ ORL to boost performance in the setting of online pedestrian trajectory prediction. Based on data from the Stanford Drone Dataset, we show that ORL can demonstrate best-of-both-worlds performance.

[LG-40] FEM-based Neural Networks for Solving Incompressible Fluid Flows and Related Inverse Problems

链接: https://arxiv.org/abs/2409.04067
作者: Franziska Griese,Fabian Hoppe,Alexander Rüttgers,Philipp Knechtges
关键词-EN: partial differential equations, simulation and optimization, optimization of technical, technical systems, partial differential
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:The numerical simulation and optimization of technical systems described by partial differential equations is expensive, especially in multi-query scenarios in which the underlying equations have to be solved for different parameters. A comparatively new approach in this context is to combine the good approximation properties of neural networks (for parameter dependence) with the classical finite element method (for discretization). However, instead of considering the solution mapping of the PDE from the parameter space into the FEM-discretized solution space as a purely data-driven regression problem, so-called physically informed regression problems have proven to be useful. In these, the equation residual is minimized during the training of the neural network, i.e. the neural network “learns” the physics underlying the problem. In this paper, we extend this approach to saddle-point and non-linear fluid dynamics problems, respectively, namely stationary Stokes and stationary Navier-Stokes equations. In particular, we propose a modification of the existing approach: Instead of minimizing the plain vanilla equation residual during training, we minimize the equation residual modified by a preconditioner. By analogy with the linear case, this also improves the condition in the present non-linear case. Our numerical examples demonstrate that this approach significantly reduces the training effort and greatly increases accuracy and generalizability. Finally, we show the application of the resulting parameterized model to a related inverse problem.

[LG-41] D4: Text-guided diffusion model-based domain adaptive data augmentation for vineyard shoot detection

链接: https://arxiv.org/abs/2409.04060
作者: Kentaro Hirahara,Chikahito Nakane,Hajime Ebisawa,Tsuyoshi Kuroda,Yohei Iwaki,Tomoyoshi Utsumi,Yuichiro Nomura,Makoto Koike,Hiroshi Mineno
关键词-EN: training data, data augmentation method, plant phenotyping, gaining attention, generative data augmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In an agricultural field, plant phenotyping using object detection models is gaining attention. However, collecting the training data necessary to create generic and high-precision models is extremely challenging due to the difficulty of annotation and the diversity of domains. Furthermore, it is difficult to transfer training data across different crops, and although machine learning models effective for specific environments, conditions, or crops have been developed, they cannot be widely applied in actual fields. In this study, we propose a generative data augmentation method (D4) for vineyard shoot detection. D4 uses a pre-trained text-guided diffusion model based on a large number of original images culled from video data collected by unmanned ground vehicles or other means, and a small number of annotated datasets. The proposed method generates new annotated images with background information adapted to the target domain while retaining annotation information necessary for object detection. In addition, D4 overcomes the lack of training data in agriculture, including the difficulty of annotation and diversity of domains. We confirmed that this generative data augmentation method improved the mean average precision by up to 28.65% for the BBox detection task and the average precision by up to 13.73% for the keypoint detection task for vineyard shoot detection. Our generative data augmentation method D4 is expected to simultaneously solve the cost and domain diversity issues of training data generation in agriculture and improve the generalization performance of detection models.

[LG-42] Heterogeneity-Aware Cooperative Federated Edge Learning with Adaptive Computation and Communication Compression

链接: https://arxiv.org/abs/2409.04022
作者: Zhenxiao Zhang,Zhidong Gao,Yuanxiong Guo,Yanmin Gong
关键词-EN: mobile edge networks, multiple edge servers, edge servers collaboratively, federated edge learning, servers collaboratively coordinate
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 20 pages, 7 figures

点击查看摘要

Abstract:Motivated by the drawbacks of cloud-based federated learning (FL), cooperative federated edge learning (CFEL) has been proposed to improve efficiency for FL over mobile edge networks, where multiple edge servers collaboratively coordinate the distributed model training across a large number of edge devices. However, CFEL faces critical challenges arising from dynamic and heterogeneous device properties, which slow down the convergence and increase resource consumption. This paper proposes a heterogeneity-aware CFEL scheme called \textitHeterogeneity-Aware Cooperative Edge-based Federated Averaging (HCEF) that aims to maximize the model accuracy while minimizing the training time and energy consumption via adaptive computation and communication compression in CFEL. By theoretically analyzing how local update frequency and gradient compression affect the convergence error bound in CFEL, we develop an efficient online control algorithm for HCEF to dynamically determine local update frequencies and compression ratios for heterogeneous devices. Experimental results show that compared with prior schemes, the proposed HCEF scheme can maintain higher model accuracy while reducing training latency and improving energy efficiency simultaneously.

[LG-43] Over-parameterized regression methods and their application to semi-supervised learning

链接: https://arxiv.org/abs/2409.04001
作者: Katsuyuki Hagiwara
关键词-EN: minimum norm, norm least squares, estimation strategy, over-parameterized case, helpful tool
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:The minimum norm least squares is an estimation strategy under an over-parameterized case and, in machine learning, is known as a helpful tool for understanding a nature of deep learning. In this paper, to apply it in a context of non-parametric regression problems, we established several methods which are based on thresholding of SVD (singular value decomposition) components, wihch are referred to as SVD regression methods. We considered several methods that are singular value based thresholding, hard-thresholding with cross validation, universal thresholding and bridge thresholding. Information on output samples is not utilized in the first method while it is utilized in the other methods. We then applied them to semi-supervised learning, in which unlabeled input samples are incorporated into kernel functions in a regressor. The experimental results for real data showed that, depending on the datasets, the SVD regression methods is superior to a naive ridge regression method. Unfortunately, there were no clear advantage of the methods utilizing information on output samples. Furthermore, for depending on datasets, incorporation of unlabeled input samples into kernels is found to have certain advantages.

[LG-44] Goal-Reaching Policy Learning from Non-Expert Observations via Effective Subgoal Guidance

链接: https://arxiv.org/abs/2409.03996
作者: RenMing Huang,Shaochong Liu,Yunqiang Pei,Peng Wang,Guoqing Wang,Yang Yang,Hengtao Shen
关键词-EN: address the challenging, challenging problem, action-free observation data, data, goal-reaching policy learning
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Accepted to CoRL 2024

点击查看摘要

Abstract:In this work, we address the challenging problem of long-horizon goal-reaching policy learning from non-expert, action-free observation data. Unlike fully labeled expert data, our data is more accessible and avoids the costly process of action labeling. Additionally, compared to online learning, which often involves aimless exploration, our data provides useful guidance for more efficient exploration. To achieve our goal, we propose a novel subgoal guidance learning strategy. The motivation behind this strategy is that long-horizon goals offer limited guidance for efficient exploration and accurate state transition. We develop a diffusion strategy-based high-level policy to generate reasonable subgoals as waypoints, preferring states that more easily lead to the final goal. Additionally, we learn state-goal value functions to encourage efficient subgoal reaching. These two components naturally integrate into the off-policy actor-critic framework, enabling efficient goal attainment through informative exploration. We evaluate our method on complex robotic navigation and manipulation tasks, demonstrating a significant performance advantage over existing methods. Our ablation study further shows that our method is robust to observation data with various corruptions.

[LG-45] An Efficient and Generalizable Symbolic Regression Method for Time Series Analysis

链接: https://arxiv.org/abs/2409.03986
作者: Yi Xie,Tianyu Qiu,Yun Xiong,Xiuqi Huang,Xiaofeng Gao,Chao Chen
关键词-EN: accurate future predictions, offering accurate future, Time series, generally falling short, underlying evolution patterns
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Time series analysis and prediction methods currently excel in quantitative analysis, offering accurate future predictions and diverse statistical indicators, but generally falling short in elucidating the underlying evolution patterns of time series. To gain a more comprehensive understanding and provide insightful explanations, we utilize symbolic regression techniques to derive explicit expressions for the non-linear dynamics in the evolution of time series variables. However, these techniques face challenges in computational efficiency and generalizability across diverse real-world time series data. To overcome these challenges, we propose \textbfNeural-\textbfEnhanced \textbfMonte-Carlo \textbfTree \textbfSearch (NEMoTS) for time series. NEMoTS leverages the exploration-exploitation balance of Monte-Carlo Tree Search (MCTS), significantly reducing the search space in symbolic regression and improving expression quality. Furthermore, by integrating neural networks with MCTS, NEMoTS not only capitalizes on their superior fitting capabilities to concentrate on more pertinent operations post-search space reduction, but also replaces the complex and time-consuming simulation process, thereby substantially improving computational efficiency and generalizability in time series analysis. NEMoTS offers an efficient and comprehensive approach to time series analysis. Experiments with three real-world datasets demonstrate NEMoTS’s significant superiority in performance, efficiency, reliability, and interpretability, making it well-suited for large-scale real-world time series data.

[LG-46] Algorithmic Collusion Without Threats

链接: https://arxiv.org/abs/2409.03956
作者: Eshwar Ram Arunachaleswaran,Natalie Collina,Sampath Kannan,Aaron Roth,Juba Ziani
关键词-EN: substantial recent concern, collude. Supra-competitive prices, Supra-competitive prices, automatically learned, substantial recent
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Theoretical Economics (econ.TH)
*备注:

点击查看摘要

Abstract:There has been substantial recent concern that pricing algorithms might learn to collude.'' Supra-competitive prices can emerge as a Nash equilibrium of repeated pricing games, in which sellers play strategies which threaten to punish their competitors who refuse to support high prices, and these strategies can be automatically learned. In fact, a standard economic intuition is that supra-competitive prices emerge from either the use of threats, or a failure of one party to optimize their payoff. Is this intuition correct? Would preventing threats in algorithmic decision-making prevent supra-competitive prices when sellers are optimizing for their own revenue? No. We show that supra-competitive prices can emerge even when both players are using algorithms which do not encode threats, and which optimize for their own revenue. We study sequential pricing games in which a first mover deploys an algorithm and then a second mover optimizes within the resulting environment. We show that if the first mover deploys any algorithm with a no-regret guarantee, and then the second mover even approximately optimizes within this now static environment, monopoly-like prices arise. The result holds for any no-regret learning algorithm deployed by the first mover and for any pricing policy of the second mover that obtains them profit at least as high as a random pricing would -- and hence the result applies even when the second mover is optimizing only within a space of non-responsive pricing distributions which are incapable of encoding threats. In fact, there exists a set of strategies, neither of which explicitly encode threats that form a Nash equilibrium of the simultaneous pricing game in algorithm space, and lead to near monopoly prices. This suggests that the definition of algorithmic collusion’’ may need to be expanded, to include strategies without explicitly encoded threats.

[LG-47] Epistemic Uncertainty and Observation Noise with the Neural Tangent Kernel

链接: https://arxiv.org/abs/2409.03953
作者: Sergio Calvo-Ordoñez,Konstantina Palla,Kamil Ciosek
关键词-EN: Neural Tangent Kernel, wide neural networks, Gaussian Process, Tangent Kernel, Neural Tangent
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 11 pages including appendix

点击查看摘要

Abstract:Recent work has shown that training wide neural networks with gradient descent is formally equivalent to computing the mean of the posterior distribution in a Gaussian Process (GP) with the Neural Tangent Kernel (NTK) as the prior covariance and zero aleatoric noise \parencitejacot2018neural. In this paper, we extend this framework in two ways. First, we show how to deal with non-zero aleatoric noise. Second, we derive an estimator for the posterior covariance, giving us a handle on epistemic uncertainty. Our proposed approach integrates seamlessly with standard training pipelines, as it involves training a small number of additional predictors using gradient descent on a mean squared error loss. We demonstrate the proof-of-concept of our method through empirical evaluation on synthetic regression.

[LG-48] he Veracity Problem: Detecting False Information and its Propagation on Online Social Media Networks

链接: https://arxiv.org/abs/2409.03948
作者: Sarah Condran
关键词-EN: Detecting false information, Detecting false, negative societal impacts, false information, media is critical
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: 4 pages, 3 figures

点击查看摘要

Abstract:Detecting false information on social media is critical in mitigating its negative societal impacts. To reduce the propagation of false information, automated detection provide scalable, unbiased, and cost-effective methods. However, there are three potential research areas identified which once solved improve detection. First, current AI-based solutions often provide a uni-dimensional analysis on a complex, multi-dimensional issue, with solutions differing based on the features used. Furthermore, these methods do not account for the temporal and dynamic changes observed within the document’s life cycle. Second, there has been little research on the detection of coordinated information campaigns and in understanding the intent of the actors and the campaign. Thirdly, there is a lack of consideration of cross-platform analysis, with existing datasets focusing on a single platform, such as X, and detection models designed for specific platform. This work aims to develop methods for effective detection of false information and its propagation. To this end, firstly we aim to propose the creation of an ensemble multi-faceted framework that leverages multiple aspects of false information. Secondly, we propose a method to identify actors and their intent when working in coordination to manipulate a narrative. Thirdly, we aim to analyse the impact of cross-platform interactions on the propagation of false information via the creation of a new dataset. Comments: 4 pages, 3 figures Subjects: Social and Information Networks (cs.SI); Machine Learning (cs.LG) MSC classes: 68T99 Cite as: arXiv:2409.03948 [cs.SI] (or arXiv:2409.03948v1 [cs.SI] for this version) https://doi.org/10.48550/arXiv.2409.03948 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3627673.3680265 Focus to learn more DOI(s) linking to related resources

[LG-49] Generating High Dimensional User-Specific Wireless Channels using Diffusion Models

链接: https://arxiv.org/abs/2409.03924
作者: Taekyun Lee,Juseong Park,Hyeji Kim,Jeffrey G. Andrews
关键词-EN: MAC layer functions, Deep neural network, Deep neural, wireless communication systems, physical and MAC
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Deep neural network (DNN)-based algorithms are emerging as an important tool for many physical and MAC layer functions in future wireless communication systems, including for large multi-antenna channels. However, training such models typically requires a large dataset of high-dimensional channel measurements, which are very difficult and expensive to obtain. This paper introduces a novel method for generating synthetic wireless channel data using diffusion-based models to produce user-specific channels that accurately reflect real-world wireless environments. Our approach employs a conditional denoising diffusion implicit models (cDDIM) framework, effectively capturing the relationship between user location and multi-antenna channel characteristics. We generate synthetic high fidelity channel samples using user positions as conditional inputs, creating larger augmented datasets to overcome measurement scarcity. The utility of this method is demonstrated through its efficacy in training various downstream tasks such as channel compression and beam alignment. Our approach significantly improves over prior methods, such as adding noise or using generative adversarial networks (GANs), especially in scenarios with limited initial measurements.

[LG-50] A Survey on Signed Graph Embedding: Methods and Applications

链接: https://arxiv.org/abs/2409.03916
作者: Shrabani Ghosh
关键词-EN: edges carry sign, carry sign information, sign information attached, edges carry, information attached
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A signed graph (SG) is a graph where edges carry sign information attached to it. The sign of a network can be positive, negative, or neutral. A signed network is ubiquitous in a real-world network like social networks, citation networks, and various technical networks. There are many network embedding models have been proposed and developed for signed networks for both homogeneous and heterogeneous types. SG embedding learns low-dimensional vector representations for nodes of a network, which helps to do many network analysis tasks such as link prediction, node classification, and community detection. In this survey, we perform a comprehensive study of SG embedding methods and applications. We introduce here the basic theories and methods of SGs and survey the current state of the art of signed graph embedding methods. In addition, we explore the applications of different types of SG embedding methods in real-world scenarios. As an application, we have explored the citation network to analyze authorship networks. We also provide source code and datasets to give future direction. Lastly, we explore the challenges of SG embedding and forecast various future research directions in this field.

[LG-51] Asynchronous Stochastic Approximation and Average-Reward Reinforcement Learning

链接: https://arxiv.org/abs/2409.03915
作者: Huizhen Yu,Yi Wan,Richard S. Sutton
关键词-EN: semi-Markov decision processes, asynchronous stochastic approximation, RVI Q-learning, paper studies asynchronous, studies asynchronous stochastic
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: The materials in this paper extend the authors’ results from 2023, reported in arXiv:2408.16262 and arXiv:2312.15091 . This paper incorporates and subsumes the results of arXiv:2312.15091 and serves as Part II of arXiv:2408.16262

点击查看摘要

Abstract:This paper studies asynchronous stochastic approximation (SA) algorithms and their application to reinforcement learning in semi-Markov decision processes (SMDPs) with an average-reward criterion. We first extend Borkar and Meyn’s stability proof method to accommodate more general noise conditions, leading to broader convergence guarantees for asynchronous SA algorithms. Leveraging these results, we establish the convergence of an asynchronous SA analogue of Schweitzer’s classical relative value iteration algorithm, RVI Q-learning, for finite-space, weakly communicating SMDPs. Furthermore, to fully utilize the SA results in this application, we introduce new monotonicity conditions for estimating the optimal reward rate in RVI Q-learning. These conditions substantially expand the previously considered algorithmic framework, and we address them with novel proof arguments in the stability and convergence analysis of RVI Q-learning.

[LG-52] WaterMAS: Sharpness-Aware Maximization for Neural Network Watermarking

链接: https://arxiv.org/abs/2409.03902
作者: Carl De Sousa Trias,Mihai Mitrea,Attilio Fiandrotti,Marco Cagnazzo,Sumanta Chaudhuri,Enzo Tartaglione
关键词-EN: deep neural networks, solving complex tasks, white-box neural network, neural network watermarking, utmost importance
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Nowadays, deep neural networks are used for solving complex tasks in several critical applications and protecting both their integrity and intellectual property rights (IPR) has become of utmost importance. To this end, we advance WaterMAS, a substitutive, white-box neural network watermarking method that improves the trade-off among robustness, imperceptibility, and computational complexity, while making provisions for increased data payload and security. WasterMAS insertion keeps unchanged the watermarked weights while sharpening their underlying gradient space. The robustness is thus ensured by limiting the attack’s strength: even small alterations of the watermarked weights would impact the model’s performance. The imperceptibility is ensured by inserting the watermark during the training process. The relationship among the WaterMAS data payload, imperceptibility, and robustness properties is discussed. The secret key is represented by the positions of the weights conveying the watermark, randomly chosen through multiple layers of the model. The security is evaluated by investigating the case in which an attacker would intercept the key. The experimental validations consider 5 models and 2 tasks (VGG16, ResNet18, MobileNetV3, SwinT for CIFAR10 image classification, and DeepLabV3 for Cityscapes image segmentation) as well as 4 types of attacks (Gaussian noise addition, pruning, fine-tuning, and quantization). The code will be released open-source upon acceptance of the article.

[LG-53] On the Convergence Rates of Federated Q-Learning across Heterogeneous Environments

链接: https://arxiv.org/abs/2409.03897
作者: Muxing Wang,Pengkun Yang,Lili Su
关键词-EN: Large-scale multi-agent systems, Large-scale multi-agent, wide geographic areas, geographic areas, multi-agent systems
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Large-scale multi-agent systems are often deployed across wide geographic areas, where agents interact with heterogeneous environments. There is an emerging interest in understanding the role of heterogeneity in the performance of the federated versions of classic reinforcement learning algorithms. In this paper, we study synchronous federated Q-learning, which aims to learn an optimal Q-function by having K agents average their local Q-estimates per E iterations. We observe an interesting phenomenon on the convergence speeds in terms of K and E . Similar to the homogeneous environment settings, there is a linear speed-up concerning K in reducing the errors that arise from sampling randomness. Yet, in sharp contrast to the homogeneous settings, E1 leads to significant performance degradation. Specifically, we provide a fine-grained characterization of the error evolution in the presence of environmental heterogeneity, which decay to zero as the number of iterations T increases. The slow convergence of having E1 turns out to be fundamental rather than an artifact of our analysis. We prove that, for a wide range of stepsizes, the \ell_\infty norm of the error cannot decay faster than \Theta (E/T) . In addition, our experiments demonstrate that the convergence exhibits an interesting two-phase phenomenon. For any given stepsize, there is a sharp phase-transition of the convergence: the error decays rapidly in the beginning yet later bounces up and stabilizes. Provided that the phase-transition time can be estimated, choosing different stepsizes for the two phases leads to faster overall convergence.

[LG-54] Understanding Fairness Metrics in Recommender Systems: A Healthcare Perspective

链接: https://arxiv.org/abs/2409.03893
作者: Veronica Kecki,Alan Said
关键词-EN: affect human lives, directly affect human, systems directly affect, critical concern, human lives
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: Accepted to the 18th ACM Conference on Recommender Systems

点击查看摘要

Abstract:Fairness in AI-driven decision-making systems has become a critical concern, especially when these systems directly affect human lives. This paper explores the public’s comprehension of fairness in healthcare recommendations. We conducted a survey where participants selected from four fairness metrics – Demographic Parity, Equal Accuracy, Equalized Odds, and Positive Predictive Value – across different healthcare scenarios to assess their understanding of these concepts. Our findings reveal that fairness is a complex and often misunderstood concept, with a generally low level of public understanding regarding fairness metrics in recommender systems. This study highlights the need for enhanced information and education on algorithmic fairness to support informed decision-making in using these systems. Furthermore, the results suggest that a one-size-fits-all approach to fairness may be insufficient, pointing to the importance of context-sensitive designs in developing equitable AI systems.

[LG-55] Overfitting Behaviour of Gaussian Kernel Ridgeless Regression: Varying Bandwidth or Dimensionality

链接: https://arxiv.org/abs/2409.03891
作者: Marko Medvedev,Gal Vardi,Nathan Srebro
关键词-EN: minimum norm interpolating, norm interpolating solutions, kernel ridge regression, input dimension varies, kernel ridgeless regression
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider the overfitting behavior of minimum norm interpolating solutions of Gaussian kernel ridge regression (i.e. kernel ridgeless regression), when the bandwidth or input dimension varies with the sample size. For fixed dimensions, we show that even with varying or tuned bandwidth, the ridgeless solution is never consistent and, at least with large enough noise, always worse than the null predictor. For increasing dimension, we give a generic characterization of the overfitting behavior for any scaling of the dimension with sample size. We use this to provide the first example of benign overfitting using the Gaussian kernel with sub-polynomial scaling dimension. All our results are under the Gaussian universality ansatz and the (non-rigorous) risk predictions in terms of the kernel eigenstructure.

[LG-56] he Influence of Faulty Labels in Data Sets on Human Pose Estimation

链接: https://arxiv.org/abs/2409.03887
作者: Arnold Schwarz,Levente Hernadi,Felix Bießmann,Kristian Hildebrand
关键词-EN: Human Pose Estimation, Pose Estimation, Human Pose, provide empirical evidence, empirical evidence demonstrating
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 15 pages, 7 figures, 5 tables

点击查看摘要

Abstract:In this study we provide empirical evidence demonstrating that the quality of training data impacts model performance in Human Pose Estimation (HPE). Inaccurate labels in widely used data sets, ranging from minor errors to severe mislabeling, can negatively influence learning and distort performance metrics. We perform an in-depth analysis of popular HPE data sets to show the extent and nature of label inaccuracies. Our findings suggest that accounting for the impact of faulty labels will facilitate the development of more robust and accurate HPE models for a variety of real-world applications. We show improved performance with cleansed data.

[LG-57] Cost-Control in Display Advertising: Theory vs Practice

链接: https://arxiv.org/abs/2409.03874
作者: Anoop R Katti,Rui C. Gonçalves,Rinchin Iakovlev
关键词-EN: display advertising, dual variables, achieve a marketing, marketing objective, optimal bidding formula
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In display advertising, advertisers want to achieve a marketing objective with constraints on budget and cost-per-outcome. This is usually formulated as an optimization problem that maximizes the total utility under constraints. The optimization is carried out in an online fashion in the dual space - for an incoming Ad auction, a bid is placed using an optimal bidding formula, assuming optimal values for the dual variables; based on the outcome of the previous auctions, the dual variables are updated in an online fashion. While this approach is theoretically sound, in practice, the dual variables are not optimal from the beginning, but rather converge over time. Specifically, for the cost-constraint, the convergence is asymptotic. As a result, we find that cost-control is ineffective. In this work, we analyse the shortcomings of the optimal bidding formula and propose a modification that deviates from the theoretical derivation. We simulate various practical scenarios and study the cost-control behaviors of the two algorithms. Through a large-scale evaluation on the real-word data, we show that the proposed modification reduces the cost violations by 50%, thereby achieving a better cost-control than the theoretical bidding formula.

[LG-58] Can We Theoretically Quantify the Impacts of Local Updates on the Generalization Performance of Federated Learning?

链接: https://arxiv.org/abs/2409.03863
作者: Peizhong Ju,Haibo Yang,Jia Liu,Yingbin Liang,Ness Shroff
关键词-EN: gained significant popularity, direct data sharing, significant popularity due, requiring direct data, Federated Learning
类目: Machine Learning (cs.LG)
*备注: Published in MobiHoc 2024

点击查看摘要

Abstract:Federated Learning (FL) has gained significant popularity due to its effectiveness in training machine learning models across diverse sites without requiring direct data sharing. While various algorithms along with their optimization analyses have shown that FL with local updates is a communication-efficient distributed learning framework, the generalization performance of FL with local updates has received comparatively less attention. This lack of investigation can be attributed to the complex interplay between data heterogeneity and infrequent communication due to the local updates within the FL framework. This motivates us to investigate a fundamental question in FL: Can we quantify the impact of data heterogeneity and local updates on the generalization performance for FL as the learning process evolves? To this end, we conduct a comprehensive theoretical study of FL’s generalization performance using a linear model as the first step, where the data heterogeneity is considered for both the stationary and online/non-stationary cases. By providing closed-form expressions of the model error, we rigorously quantify the impact of the number of the local updates (denoted as K ) under three settings ( K=1 , K\infty , and K=\infty ) and show how the generalization performance evolves with the number of rounds t . Our investigation also provides a comprehensive understanding of how different configurations (including the number of model parameters p and the number of training samples n ) contribute to the overall generalization performance, thus shedding new insights (such as benign overfitting) for implementing FL over networks.

[LG-59] Latent Space Energy-based Neural ODEs

链接: https://arxiv.org/abs/2409.03845
作者: Sheng Cheng,Deqian Kong,Jianwen Xie,Kookjin Lee,Ying Nian Wu,Yezhou Yang
关键词-EN: deep dynamical models, dynamical models designed, represent continuous-time sequence, continuous-time sequence data, paper introduces
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper introduces a novel family of deep dynamical models designed to represent continuous-time sequence data. This family of models generates each data point in the time series by a neural emission model, which is a non-linear transformation of a latent state vector. The trajectory of the latent states is implicitly described by a neural ordinary differential equation (ODE), with the initial state following an informative prior distribution parameterized by an energy-based model. Furthermore, we can extend this model to disentangle dynamic states from underlying static factors of variation, represented as time-invariant variables in the latent space. We train the model using maximum likelihood estimation with Markov chain Monte Carlo (MCMC) in an end-to-end manner, without requiring additional assisting components such as an inference network. Our experiments on oscillating systems, videos and real-world state sequences (MuJoCo) illustrate that ODEs with the learnable energy-based prior outperform existing counterparts, and can generalize to new dynamic parameterization, enabling long-horizon predictions.

[LG-60] Neural Entropy

链接: https://arxiv.org/abs/2409.03817
作者: Akhil Premkumar
关键词-EN: examine the connection, connection between deep, deep learning, information, information theory
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Information Theory (cs.IT)
*备注: 37 pages + references, 11 figures

点击查看摘要

Abstract:We examine the connection between deep learning and information theory through the paradigm of diffusion models. Using well-established principles from non-equilibrium thermodynamics we can characterize the amount of information required to reverse a diffusive process. Neural networks store this information and operate in a manner reminiscent of Maxwell’s demon during the generative stage. We illustrate this cycle using a novel diffusion scheme we call the entropy matching model, wherein the information conveyed to the network during training exactly corresponds to the entropy that must be negated during reversal. We demonstrate that this entropy can be used to analyze the encoding efficiency and storage capacity of the network. This conceptual picture blends elements of stochastic optimal control, thermodynamics, information theory, and optimal transport, and raises the prospect of applying diffusion models as a test bench to understand neural networks.

[LG-61] How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data

链接: https://arxiv.org/abs/2409.03810
作者: Yejie Wang,Keqing He,Dayuan Fu,Zhuoma Gongque,Heyang Xu,Yanxu Chen,Zhexu Wang,Yujia Fu,Guanting Dong,Muxi Diao,Jingang Wang,Mengdi Zhang,Xunliang Cai,Weiran Xu
关键词-EN: growing interest, interest in studying, Recently, data, code
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Working in progress

点击查看摘要

Abstract:Recently, there has been a growing interest in studying how to construct better code instruction tuning data. However, we observe Code models trained with these datasets exhibit high performance on HumanEval but perform worse on other benchmarks such as LiveCodeBench. Upon further investigation, we find that many datasets suffer from severe data leakage. After cleaning up most of the leaked data, some well-known high-quality datasets perform poorly. This discovery reveals a new challenge: identifying which dataset genuinely qualify as high-quality code instruction data. To address this, we propose an efficient code data pruning strategy for selecting good samples. Our approach is based on three dimensions: instruction complexity, response quality, and instruction diversity. Based on our selected data, we present XCoder, a family of models finetuned from LLaMA3. Our experiments show XCoder achieves new state-of-the-art performance using fewer training data, which verify the effectiveness of our data strategy. Moreover, we perform a comprehensive analysis on the data composition and find existing code datasets have different characteristics according to their construction methods, which provide new insights for future code LLMs. Our models and dataset are released in this https URL

[LG-62] Accelerate Neural Subspace-Based Reduced-Order Solver of Deformable Simulation by Lipschitz Optimization

链接: https://arxiv.org/abs/2409.03807
作者: Aoran Lyu,Shixian Zhao,Chuhua Xian,Zhihao Cen,Hongmin Cai,Guoxin Fang
关键词-EN: recently developed, proven effective, effective in diverse, diverse applications, accelerating physical simulations
类目: Machine Learning (cs.LG); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Reduced-order simulation is an emerging method for accelerating physical simulations with high DOFs, and recently developed neural-network-based methods with nonlinear subspaces have been proven effective in diverse applications as more concise subspaces can be detected. However, the complexity and landscape of simulation objectives within the subspace have not been optimized, which leaves room for enhancement of the convergence speed. This work focuses on this point by proposing a general method for finding optimized subspace mappings, enabling further acceleration of neural reduced-order simulations while capturing comprehensive representations of the configuration manifolds. We achieve this by optimizing the Lipschitz energy of the elasticity term in the simulation objective, and incorporating the cubature approximation into the training process to manage the high memory and time demands associated with optimizing the newly introduced energy. Our method is versatile and applicable to both supervised and unsupervised settings for optimizing the parameterizations of the configuration manifolds. We demonstrate the effectiveness of our approach through general cases in both quasi-static and dynamics simulations. Our method achieves acceleration factors of up to 6.83 while consistently preserving comparable simulation accuracy in various cases, including large twisting, bending, and rotational deformations with collision handling. This novel approach offers significant potential for accelerating physical simulations, and can be a good add-on to existing neural-network-based solutions in modeling complex deformable objects.

[LG-63] Protecting Activity Sensing Data Privacy Using Hierarchical Information Dissociation

链接: https://arxiv.org/abs/2409.03796
作者: Guangjing Wang,Hanqing Guo,Yuanda Wang,Bocheng Chen,Ce Zhou,Qiben Yan
关键词-EN: offering personalized services, Smartphones and wearable, sensing data, information, daily lives
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Smartphones and wearable devices have been integrated into our daily lives, offering personalized services. However, many apps become overprivileged as their collected sensing data contains unnecessary sensitive information. For example, mobile sensing data could reveal private attributes (e.g., gender and age) and unintended sensitive features (e.g., hand gestures when entering passwords). To prevent sensitive information leakage, existing methods must obtain private labels and users need to specify privacy policies. However, they only achieve limited control over information disclosure. In this work, we present Hippo to dissociate hierarchical information including private metadata and multi-grained activity information from the sensing data. Hippo achieves fine-grained control over the disclosure of sensitive information without requiring private labels. Specifically, we design a latent guidance-based diffusion model, which generates multi-grained versions of raw sensor data conditioned on hierarchical latent activity features. Hippo enables users to control the disclosure of sensitive information in sensing data, ensuring their privacy while preserving the necessary features to meet the utility requirements of applications. Hippo is the first unified model that achieves two goals: perturbing the sensitive attributes and controlling the disclosure of sensitive information in mobile sensing data. Extensive experiments show that Hippo can anonymize personal attributes and transform activity information at various resolutions across different types of sensing data.

[LG-64] HSF: Defending against Jailbreak Attacks with Hidden State Filtering

链接: https://arxiv.org/abs/2409.03788
作者: Cheng Qian,Hainan Zhang,Lei Sha,Zhiming Zheng
关键词-EN: ensure outputs align, avoid harmful content, LLM hidden state, content generation, jailbreak attacks
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 13 pages

点击查看摘要

Abstract:With the growing deployment of LLMs in daily applications like chatbots and content generation, efforts to ensure outputs align with human values and avoid harmful content have intensified. However, increasingly sophisticated jailbreak attacks threaten this alignment, aiming to induce unsafe outputs. Current defense efforts either focus on prompt rewriting or detection, which are limited in effectiveness due to the various design of jailbreak prompts, or on output control and detection, which are computationally expensive as they require LLM inference. Therefore, designing a pre-inference defense method that resists diverse jailbreak prompts is crucial for preventing LLM jailbreak attacks. We observe that jailbreak attacks, safe queries, and harmful queries exhibit different clustering patterns within the LLM’s hidden state representation space. This suggests that by leveraging the LLM’s hidden state representational capabilities, we can analyze the LLM’s forthcoming behavior and proactively intervene for defense. In this paper, we propose a jailbreak attack defense strategy based on a Hidden State Filter (HSF), a lossless architectural defense mechanism that enables the model to preemptively identify and reject adversarial inputs before the inference process begins. We activate its defensive potential through an additional plugin module, effectively framing the defense task as a classification problem. Experimental results on two benchmark datasets, utilizing three different LLMs, show that HSF significantly enhances resilience against six cutting-edge jailbreak attacks. It significantly reduces the success rate of jailbreak attacks while minimally impacting responses to benign user queries, with negligible inference overhead, and outperforming defense baselines.Our code and data are available at https://anonymous.4open.science/r/Hidden-State-Filtering-8652/

[LG-65] A Greedy Hierarchical Approach to Whole-Network Filter- Pruning in CNNs

链接: https://arxiv.org/abs/2409.03777
作者: Kiran Purohit,Anurag Reddy Parvathgari,Sourangshu Bhattacharya
关键词-EN: Deep convolutional neural, convolutional neural networks, achieved impressive performance, Deep convolutional, computer vision tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted in TMLR 2024

点击查看摘要

Abstract:Deep convolutional neural networks (CNNs) have achieved impressive performance in many computer vision tasks. However, their large model sizes require heavy computational resources, making pruning redundant filters from existing pre-trained CNNs an essential task in developing efficient models for resource-constrained devices. Whole-network filter pruning algorithms prune varying fractions of filters from each layer, hence providing greater flexibility. Current whole-network pruning methods are either computationally expensive due to the need to calculate the loss for each pruned filter using a training dataset, or use various heuristic / learned criteria for determining the pruning fractions for each layer. This paper proposes a two-level hierarchical approach for whole-network filter pruning which is efficient and uses the classification loss as the final criterion. The lower-level algorithm (called filter-pruning) uses a sparse-approximation formulation based on linear approximation of filter weights. We explore two algorithms: orthogonal matching pursuit-based greedy selection and a greedy backward pruning approach. The backward pruning algorithm uses a novel closed-form error criterion for efficiently selecting the optimal filter at each stage, thus making the whole algorithm much faster. The higher-level algorithm (called layer-selection) greedily selects the best-pruned layer (pruning using the filter-selection algorithm) using a global pruning criterion. We propose algorithms for two different global-pruning criteria: (1) layer-wise relative error (HBGS), and (2) final classification error (HBGTS). Our suite of algorithms outperforms state-of-the-art pruning methods on ResNet18, ResNet32, ResNet56, VGG16, and ResNext101. Our method reduces the RAM requirement for ResNext101 from 7.6 GB to 1.5 GB and achieves a 94% reduction in FLOPS without losing accuracy on CIFAR-10.

[LG-66] Federated Learning Approach to Mitigate Water Wastage

链接: https://arxiv.org/abs/2409.03776
作者: Sina Hajer Ahmadi,Amruta Pranadika Mahashabde
关键词-EN: North America accounts, billion gallons daily, North America, water wasted due, America accounts
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Residential outdoor water use in North America accounts for nearly 9 billion gallons daily, with approximately 50% of this water wasted due to over-watering, particularly in lawns and gardens. This inefficiency highlights the need for smart, data-driven irrigation systems. Traditional approaches to reducing water wastage have focused on centralized data collection and processing, but such methods can raise privacy concerns and may not account for the diverse environmental conditions across different regions. In this paper, we propose a federated learning-based approach to optimize water usage in residential and agricultural settings. By integrating moisture sensors and actuators with a distributed network of edge devices, our system allows each user to locally train a model on their specific environmental data while sharing only model updates with a central server. This preserves user privacy and enables the creation of a global model that can adapt to varying conditions. Our implementation leverages low-cost hardware, including an Arduino Uno microcontroller and soil moisture sensors, to demonstrate how federated learning can be applied to reduce water wastage while maintaining efficient crop production. The proposed system not only addresses the need for water conservation but also provides a scalable, privacy-preserving solution adaptable to diverse environments.

[LG-67] Representation Learning of Complex Assemblies An Effort to Improve Corporate Scope 3 Emissions Calculation

链接: https://arxiv.org/abs/2409.03769
作者: Ajay Chatterjee,Srikanth Ranganathan
关键词-EN: pressing global concern, climate impact, citizens alike, pressing global, Climate change
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Climate change is a pressing global concern for governments, corporations, and citizens alike. This concern underscores the necessity for these entities to accurately assess the climate impact of manufacturing goods and providing services. Tools like process life cycle analysis (pLCA) are used to evaluate the climate impact of production, use, and disposal, from raw material mining through end-of-life. pLCA further enables practitioners to look deeply into material choices or manufacturing processes for individual parts, sub-assemblies, assemblies, and the final product. Reliable and detailed data on the life cycle stages and processes of the product or service under study are not always available or accessible, resulting in inaccurate assessment of climate impact. To overcome the data limitation and enhance the effectiveness of pLCA to generate an improved environmental impact profile, we are adopting an innovative strategy to identify alternative parts, products, and components that share similarities in terms of their form, function, and performance to serve as qualified substitutes. Focusing on enterprise electronics hardware, we propose a semi-supervised learning-based framework to identify substitute parts that leverages product bill of material (BOM) data and a small amount of component-level qualified substitute data (positive samples) to generate machine knowledge graph (MKG) and learn effective embeddings of the components that constitute electronic hardware. Our methodology is grounded in attributed graph embeddings and introduces a strategy to generate biased negative samples to significantly enhance the training process. We demonstrate improved performance and generalization over existing published models.

[LG-68] EMCNet : Graph-Nets for Electron Micrographs Classification KDD2022

链接: https://arxiv.org/abs/2409.03767
作者: Sakhinana Sagar Srinivas,Rajat Kumar Sarkar,Venkataramana Runkana
关键词-EN: materials processing industries, Characterization of materials, processing industries, materials processing, important and challenging
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 12 pages, 10 figures, Accepted in a ACM SIGKDD 2022 Workshop on Machine Learning for Materials

点击查看摘要

Abstract:Characterization of materials via electron micrographs is an important and challenging task in several materials processing industries. Classification of electron micrographs is complex due to the high intra-class dissimilarity, high inter-class similarity, and multi-spatial scales of patterns. However, existing methods are ineffective in learning complex image patterns. We propose an effective end-to-end electron micrograph representation learning-based framework for nanomaterial identification to overcome the challenges. We demonstrate that our framework outperforms the popular baselines on the open-source datasets in nanomaterials-based identification tasks. The ablation studies are reported in great detail to support the efficacy of our approach.

[LG-69] Rethinking Deep Learning: Propagating Information in Neural Networks without Backpropagation and Statistical Optimization

链接: https://arxiv.org/abs/2409.03760
作者: Kei Itoh
关键词-EN: resolving social issues, advancing human civilization, Developing strong, statistical weight optimization, weight optimization techniques
类目: Machine Learning (cs.LG)
*备注: 9 pages, 3 figures

点击查看摘要

Abstract:Developing strong AI signifies the arrival of technological singularity, contributing greatly to advancing human civilization and resolving social issues. Neural networks (NNs) and deep learning, which utilize NNs, are expected to lead to strong AI due to their biological neural system-mimicking structures. However, the statistical weight optimization techniques commonly used, such as error backpropagation and loss functions, may hinder the mimicry of neural systems. This study discusses the information propagation capabilities and potential practical applications of NNs as neural system mimicking structures by solving the handwritten character recognition problem in the Modified National Institute of Standards and Technology (MNIST) database without using statistical weight optimization techniques like error backpropagation. In this study, the NNs architecture comprises fully connected layers using step functions as activation functions, with 0-15 hidden layers, and no weight updates. The accuracy is calculated by comparing the average output vectors of the training data for each label with the output vectors of the test data, based on vector similarity. The results showed that the maximum accuracy achieved is around 80%. This indicates that NNs can propagate information correctly without using statistical weight optimization. Additionally, the accuracy decreased with an increasing number of hidden layers. This is attributed to the decrease in the variance of the output vectors as the number of hidden layers increases, suggesting that the output data becomes smooth. This study’s NNs and accuracy calculation methods are simple and have room for various improvements. Moreover, creating a feedforward NNs that repeatedly cycles through ‘input - processing - output - environmental response - input - …’ could pave the way for practical software applications.

[LG-70] Quantum Kernel Methods under Scrutiny: A Benchmarking Study

链接: https://arxiv.org/abs/2409.04406
作者: Jan Schnabel,Marco Roth
关键词-EN: gained increasing attention, probing promising applications, delivering intriguing research, intriguing research insights, quantum kernel methods
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 19 pages main text including 12 figures, appendix 25 pages with 31 figures

点击查看摘要

Abstract:Since the entry of kernel theory in the field of quantum machine learning, quantum kernel methods (QKMs) have gained increasing attention with regard to both probing promising applications and delivering intriguing research insights. Two common approaches for computing the underlying Gram matrix have emerged: fidelity quantum kernels (FQKs) and projected quantum kernels (PQKs). Benchmarking these methods is crucial to gain robust insights and to understand their practical utility. In this work, we present a comprehensive large-scale study examining QKMs based on FQKs and PQKs across a manifold of design choices. Our investigation encompasses both classification and regression tasks for five dataset families and 64 datasets, systematically comparing the use of FQKs and PQKs quantum support vector machines and kernel ridge regression. This resulted in over 20,000 models that were trained and optimized using a state-of-the-art hyperparameter search to ensure robust and comprehensive insights. We delve into the importance of hyperparameters on model performance scores and support our findings through rigorous correlation analyses. In this, we also closely inspect two data encoding strategies. Moreover, we provide an in-depth analysis addressing the design freedom of PQKs and explore the underlying principles responsible for learning. Our goal is not to identify the best-performing model for a specific task but to uncover the mechanisms that lead to effective QKMs and reveal universal patterns.

[LG-71] Leveraging Machine Learning for Official Statistics: A Statistical Manifesto

链接: https://arxiv.org/abs/2409.04365
作者: Marco Puts,David Salgado,Piet Daas
关键词-EN: machine learning, opportunities and challenges, Total Machine Learning, production to apply, presents both opportunities
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 29 pages, 4 figures, 1 table. To appear in the proceedings of the conference on Foundations and Advances of Machine Learning in Official Statistics, which was held in Wiesbaden, from 3rd to 5th April, 2024

点击查看摘要

Abstract:It is important for official statistics production to apply ML with statistical rigor, as it presents both opportunities and challenges. Although machine learning has enjoyed rapid technological advances in recent years, its application does not possess the methodological robustness necessary to produce high quality statistical results. In order to account for all sources of error in machine learning models, the Total Machine Learning Error (TMLE) is presented as a framework analogous to the Total Survey Error Model used in survey methodology. As a means of ensuring that ML models are both internally valid as well as externally valid, the TMLE model addresses issues such as representativeness and measurement errors. There are several case studies presented, illustrating the importance of applying more rigor to the application of machine learning in official statistics.

[LG-72] CISCA and CytoDArk0: a Cell Instance Segmentation and Classification method for histo(patho)logical image Analyses and a new open Nissl-stained dataset for brain cytoarchitecture studies

链接: https://arxiv.org/abs/2409.04175
作者: Valentina Vadori,Jean-Marie Graïc,Antonella Peruffo,Giulia Vadori,Livio Finos,Enrico Grisan
关键词-EN: complex task, biological investigations, Delineating and classifying, pivotal endeavor, medical and biological
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Delineating and classifying individual cells in microscopy tissue images is a complex task, yet it is a pivotal endeavor in various medical and biological investigations. We propose a new deep learning framework (CISCA) for automatic cell instance segmentation and classification in histological slices to support detailed morphological and structural analysis or straightforward cell counting in digital pathology workflows and brain cytoarchitecture studies. At the core of CISCA lies a network architecture featuring a lightweight U-Net with three heads in the decoder. The first head classifies pixels into boundaries between neighboring cells, cell bodies, and background, while the second head regresses four distance maps along four directions. The network outputs from the first and second heads are integrated through a tailored post-processing step, which ultimately yields the segmentation of individual cells. A third head enables simultaneous classification of cells into relevant classes, if required. We showcase the effectiveness of our method using four datasets, including CoNIC, PanNuke, and MoNuSeg, which are publicly available H\E datasets. Additionally, we introduce CytoDArk0, a novel dataset consisting of Nissl-stained images of the cortex, cerebellum, and hippocampus from mammals belonging to the orders Cetartiodactyla and Primates. We evaluate CISCA in comparison to other state-of-the-art methods, demonstrating CISCA’s robustness and accuracy in segmenting and classifying cells across diverse tissue types, magnifications, and staining techniques.

[LG-73] An efficient hp-Variational PINNs framework for incompressible Navier-Stokes equations

链接: https://arxiv.org/abs/2409.04143
作者: Thivin Anandh,Divij Ghose,Ankit Tyagi,Abhineet Gupta,Suranjan Sarkar,Sashikumaar Ganesan
关键词-EN: Physics-informed neural networks, Variational Physics-Informed Neural, neural networks, Physics-informed neural, partial differential equations
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 18 pages, 13 tables and 20 figures

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) are able to solve partial differential equations (PDEs) by incorporating the residuals of the PDEs into their loss functions. Variational Physics-Informed Neural Networks (VPINNs) and hp-VPINNs use the variational form of the PDE residuals in their loss function. Although hp-VPINNs have shown promise over traditional PINNs, they suffer from higher training times and lack a framework capable of handling complex geometries, which limits their application to more complex PDEs. As such, hp-VPINNs have not been applied in solving the Navier-Stokes equations, amongst other problems in CFD, thus far. FastVPINNs was introduced to address these challenges by incorporating tensor-based loss computations, significantly improving the training efficiency. Moreover, by using the bilinear transformation, the FastVPINNs framework was able to solve PDEs on complex geometries. In the present work, we extend the FastVPINNs framework to vector-valued problems, with a particular focus on solving the incompressible Navier-Stokes equations for two-dimensional forward and inverse problems, including problems such as the lid-driven cavity flow, the Kovasznay flow, and flow past a backward-facing step for Reynolds numbers up to 200. Our results demonstrate a 2x improvement in training time while maintaining the same order of accuracy compared to PINNs algorithms documented in the literature. We further showcase the framework’s efficiency in solving inverse problems for the incompressible Navier-Stokes equations by accurately identifying the Reynolds number of the underlying flow. Additionally, the framework’s ability to handle complex geometries highlights its potential for broader applications in computational fluid dynamics. This implementation opens new avenues for research on hp-VPINNs, potentially extending their applicability to more complex problems.

[LG-74] Half-VAE: An Encoder-Free VAE to Bypass Explicit Inverse Mapping

链接: https://arxiv.org/abs/2409.04140
作者: Yuan-Hao Wei,Yan-Jie Sun,Chen Zhang
关键词-EN: closely related concepts, Bayesian inference, observed data, inverse mapping process, fundamentally involving
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Inference and inverse problems are closely related concepts, both fundamentally involving the deduction of unknown causes or parameters from observed data. Bayesian inference, a powerful class of methods, is often employed to solve a variety of problems, including those related to causal inference. Variational inference, a subset of Bayesian inference, is primarily used to efficiently approximate complex posterior distributions. Variational Autoencoders (VAEs), which combine variational inference with deep learning, have become widely applied across various domains. This study explores the potential of VAEs for solving inverse problems, such as Independent Component Analysis (ICA), without relying on an explicit inverse mapping process. Unlike other VAE-based ICA methods, this approach discards the encoder in the VAE architecture, directly setting the latent variables as trainable parameters. In other words, the latent variables are no longer outputs of the encoder but are instead optimized directly through the objective function to converge to appropriate values. We find that, with a suitable prior setup, the latent variables, represented by trainable parameters, can exhibit mutually independent properties as the parameters converge, all without the need for an encoding process. This approach, referred to as the Half-VAE, bypasses the inverse mapping process by eliminating the encoder. This study demonstrates the feasibility of using the Half-VAE to solve ICA without the need for an explicit inverse mapping process.

[LG-75] Study of Brain Network in Alzheimers Disease Using Wavelet-Based Graph Theory Method

链接: https://arxiv.org/abs/2409.04072
作者: Ali Khazaee,Abdolreza Mohammadi,Ruairi Oreally
关键词-EN: neurodegenerative disorder marked, making early detection, early detection vital, timely intervention, neurodegenerative disorder
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Alzheimer’s disease (AD) is a neurodegenerative disorder marked by memory loss and cognitive decline, making early detection vital for timely intervention. However, early diagnosis is challenging due to the heterogeneous presentation of symptoms. Resting-state fMRI (rs-fMRI) captures spontaneous brain activity and functional connectivity, which are known to be disrupted in AD and mild cognitive impairment (MCI). Traditional methods, such as Pearson’s correlation, have been used to calculate association matrices, but these approaches often overlook the dynamic and non-stationary nature of brain activity. In this study, we introduce a novel method that integrates discrete wavelet transform (DWT) and graph theory to model the dynamic behavior of brain networks. By decomposing rs-fMRI signals using DWT, our approach captures the time-frequency representation of brain activity, allowing for a more nuanced analysis of the underlying network dynamics. Graph theory provides a robust mathematical framework to analyze these complex networks, while machine learning is employed to automate the discrimination of different stages of AD based on learned patterns from different frequency bands. We applied our method to a dataset of rs-fMRI images from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database, demonstrating its potential as an early diagnostic tool for AD and for monitoring disease progression. Our statistical analysis identifies specific brain regions and connections that are affected in AD and MCI, at different frequency bands, offering deeper insights into the disease’s impact on brain function.

[LG-76] Entry-Specific Matrix Estimation under Arbitrary Sampling Patterns through the Lens of Network Flows

链接: https://arxiv.org/abs/2409.03980
作者: Yudong Chen,Xumei Xi,Christina Lee Yu
关键词-EN: low-rank matrix based, Matrix completion tackles, tackles the task, task of predicting, predicting missing
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Matrix completion tackles the task of predicting missing values in a low-rank matrix based on a sparse set of observed entries. It is often assumed that the observation pattern is generated uniformly at random or has a very specific structure tuned to a given algorithm. There is still a gap in our understanding when it comes to arbitrary sampling patterns. Given an arbitrary sampling pattern, we introduce a matrix completion algorithm based on network flows in the bipartite graph induced by the observation pattern. For additive matrices, the particular flow we used is the electrical flow and we establish error upper bounds customized to each entry as a function of the observation set, along with matching minimax lower bounds. Our results show that the minimax squared error for recovery of a particular entry in the matrix is proportional to the effective resistance of the corresponding edge in the graph. Furthermore, we show that our estimator is equivalent to the least squares estimator. We apply our estimator to the two-way fixed effects model and show that it enables us to accurately infer individual causal effects and the unit-specific and time-specific confounders. For rank- 1 matrices, we use edge-disjoint paths to form an estimator that achieves minimax optimal estimation when the sampling is sufficiently dense. Our discovery introduces a new family of estimators parametrized by network flows, which provide a fine-grained and intuitive understanding of the impact of the given sampling pattern on the relative difficulty of estimation at an entry-specific level. This graph-based approach allows us to quantify the inherent complexity of matrix completion for individual entries, rather than relying solely on global measures of performance.

[LG-77] Bi-modality Images Transfer with a Discrete Process Matching Method

链接: https://arxiv.org/abs/2409.03977
作者: Zhe Xiong,Qiaoqiao Ding,Xiaoqun Zhang
关键词-EN: medical image synthesis, image synthesis gains, rapid development, image synthesis, medical image
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, medical image synthesis gains more and more popularity, along with the rapid development of generative models. Medical image synthesis aims to generate an unacquired image modality, often from other observed data modalities. Synthesized images can be used for clinical diagnostic assistance, data augmentation for model training and validation or image quality improving. In the meanwhile, the flow-based models are among the successful generative models for the ability of generating realistic and high-quality synthetic images. However, most flow-based models require to calculate flow ordinary different equation (ODE) evolution steps in transfer process, for which the performances are significantly limited by heavy computation time due to a large number of time iterations. In this paper, we propose a novel flow-based model, namely Discrete Process Matching (DPM) to accomplish the bi-modality image transfer tasks. Different to other flow matching based models, we propose to utilize both forward and backward ODE flow and enhance the consistency on the intermediate images of few discrete time steps, resulting in a transfer process with much less iteration steps while maintaining high-quality generations for both modalities. Our experiments on three datasets of MRI T1/T2 and CT/MRI demonstrate that DPM outperforms other state-of-the-art flow-based methods for bi-modality image synthesis, achieving higher image quality with less computation time cost.

[LG-78] Average Causal Effect Estimation in DAGs with Hidden Variables: Extensions of Back-Door and Front-Door Criteria

链接: https://arxiv.org/abs/2409.03962
作者: Anna Guo,Razieh Nabi
关键词-EN: directed acyclic graphs, g-formula remain limited, acyclic graphs, remain limited, hidden variables
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The identification theory for causal effects in directed acyclic graphs (DAGs) with hidden variables is well-developed, but methods for estimating and inferring functionals beyond the g-formula remain limited. Previous studies have proposed semiparametric estimators for identifiable functionals in a broad class of DAGs with hidden variables. While demonstrating double robustness in some models, existing estimators face challenges, particularly with density estimation and numerical integration for continuous variables, and their estimates may fall outside the parameter space of the target estimand. Their asymptotic properties are also underexplored, especially when using flexible statistical and machine learning models for nuisance estimation. This study addresses these challenges by introducing novel one-step corrected plug-in and targeted minimum loss-based estimators of causal effects for a class of DAGs that extend classical back-door and front-door criteria (known as the treatment primal fixability criterion in prior literature). These estimators leverage machine learning to minimize modeling assumptions while ensuring key statistical properties such as asymptotic linearity, double robustness, efficiency, and staying within the bounds of the target parameter space. We establish conditions for nuisance functional estimates in terms of L2§-norms to achieve root-n consistent causal effect estimates. To facilitate practical application, we have developed the flexCausal package in R.

[LG-79] Active Sampling of Interpolation Points to Identify Dominant Subspaces for Model Reduction

链接: https://arxiv.org/abs/2409.03892
作者: Celine Reddig,Pawan Goyal,Igor Pontes Duff,Peter Benner
关键词-EN: engineering design cycles, accelerate engineering design, low-dimensional surrogate models, active research field, construct low-dimensional surrogate
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Dynamical Systems (math.DS); Numerical Analysis (math.NA)
*备注: 20 pages, 9 figures

点击查看摘要

Abstract:Model reduction is an active research field to construct low-dimensional surrogate models of high fidelity to accelerate engineering design cycles. In this work, we investigate model reduction for linear structured systems using dominant reachable and observable subspaces. When the training set - containing all possible interpolation points - is large, then these subspaces can be determined by solving many large-scale linear systems. However, for high-fidelity models, this easily becomes computationally intractable. To circumvent this issue, in this work, we propose an active sampling strategy to sample only a few points from the given training set, which can allow us to estimate those subspaces accurately. To this end, we formulate the identification of the subspaces as the solution of the generalized Sylvester equations, guiding us to select the most relevant samples from the training set to achieve our goals. Consequently, we construct solutions of the matrix equations in low-rank forms, which encode subspace information. We extensively discuss computational aspects and efficient usage of the low-rank factors in the process of obtaining reduced-order models. We illustrate the proposed active sampling scheme to obtain reduced-order models via dominant reachable and observable subspaces and present its comparison with the method where all the points from the training set are taken into account. It is shown that the active sample strategy can provide us 17 x speed-up without sacrificing any noticeable accuracy.

[LG-80] Resultant: Incremental Effectiveness on Likelihood for Unsupervised Out-of-Distribution Detection

链接: https://arxiv.org/abs/2409.03801
作者: Yewen Li,Chaojie Wang,Xiaobo Xia,Xu He,Ruyi An,Dong Li,Tongliang Liu,Bo An,Xinrun Wang
关键词-EN: identify OOD data, OOD data samples, detector trained solely, OOD data, data samples
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unsupervised out-of-distribution (U-OOD) detection is to identify OOD data samples with a detector trained solely on unlabeled in-distribution (ID) data. The likelihood function estimated by a deep generative model (DGM) could be a natural detector, but its performance is limited in some popular “hard” benchmarks, such as FashionMNIST (ID) vs. MNIST (OOD). Recent studies have developed various detectors based on DGMs to move beyond likelihood. However, despite their success on “hard” benchmarks, most of them struggle to consistently surpass or match the performance of likelihood on some “non-hard” cases, such as SVHN (ID) vs. CIFAR10 (OOD) where likelihood could be a nearly perfect detector. Therefore, we appeal for more attention to incremental effectiveness on likelihood, i.e., whether a method could always surpass or at least match the performance of likelihood in U-OOD detection. We first investigate the likelihood of variational DGMs and find its detection performance could be improved in two directions: i) alleviating latent distribution mismatch, and ii) calibrating the dataset entropy-mutual integration. Then, we apply two techniques for each direction, specifically post-hoc prior and dataset entropy-mutual calibration. The final method, named Resultant, combines these two directions for better incremental effectiveness compared to either technique alone. Experimental results demonstrate that the Resultant could be a new state-of-the-art U-OOD detector while maintaining incremental effectiveness on likelihood in a wide range of tasks.

[LG-81] Evaluating Machine Learning-based Skin Cancer Diagnosis

链接: https://arxiv.org/abs/2409.03794
作者: Tanish Jain
关键词-EN: skin cancer detection, deep learning models, cancer detection, evaluates the reliability, deep learning
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 14 pages

点击查看摘要

Abstract:This study evaluates the reliability of two deep learning models for skin cancer detection, focusing on their explainability and fairness. Using the HAM10000 dataset of dermatoscopic images, the research assesses two convolutional neural network architectures: a MobileNet-based model and a custom CNN model. Both models are evaluated for their ability to classify skin lesions into seven categories and to distinguish between dangerous and benign lesions. Explainability is assessed using Saliency Maps and Integrated Gradients, with results interpreted by a dermatologist. The study finds that both models generally highlight relevant features for most lesion types, although they struggle with certain classes like seborrheic keratoses and vascular lesions. Fairness is evaluated using the Equalized Odds metric across sex and skin tone groups. While both models demonstrate fairness across sex groups, they show significant disparities in false positive and false negative rates between light and dark skin tones. A Calibrated Equalized Odds postprocessing strategy is applied to mitigate these disparities, resulting in improved fairness, particularly in reducing false negative rate differences. The study concludes that while the models show promise in explainability, further development is needed to ensure fairness across different skin tones. These findings underscore the importance of rigorous evaluation of AI models in medical applications, particularly in diverse population groups.

[LG-82] CoPRA: Bridging Cross-domain Pretrained Sequence Models with Complex Structures for Protein-RNA Binding Affinity Prediction

链接: https://arxiv.org/abs/2409.03773
作者: Rong Han,Xiaohong Liu,Tong Pan,Jing Xu,Xiaoyu Wang,Wuyang Lan,Zhenyu Li,Zixuan Wang,Jiangning Song,Guangyu Wang,Ting Chen
关键词-EN: protein-RNA binding affinity, binding affinity, binding affinity prediction, protein-RNA binding, binding
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately measuring protein-RNA binding affinity is crucial in many biological processes and drug design. Previous computational methods for protein-RNA binding affinity prediction rely on either sequence or structure features, unable to capture the binding mechanisms comprehensively. The recent emerging pre-trained language models trained on massive unsupervised sequences of protein and RNA have shown strong representation ability for various in-domain downstream tasks, including binding site prediction. However, applying different-domain language models collaboratively for complex-level tasks remains unexplored. In this paper, we propose CoPRA to bridge pre-trained language models from different biological domains via Complex structure for Protein-RNA binding Affinity prediction. We demonstrate for the first time that cross-biological modal language models can collaborate to improve binding affinity prediction. We propose a Co-Former to combine the cross-modal sequence and structure information and a bi-scope pre-training strategy for improving Co-Former’s interaction understanding. Meanwhile, we build the largest protein-RNA binding affinity dataset PRA310 for performance evaluation. We also test our model on a public dataset for mutation effect prediction. CoPRA reaches state-of-the-art performance on all the datasets. We provide extensive analyses and verify that CoPRA can (1) accurately predict the protein-RNA binding affinity; (2) understand the binding affinity change caused by mutations; and (3) benefit from scaling data and model size.

[LG-83] Exploiting XAI maps to improve MS lesion segmentation and detection in MRI

链接: https://arxiv.org/abs/2409.03772
作者: Federico Spagnolo,Nataliia Molchanova,Mario Ocampo Pineda,Lester Melie-Garcia,Meritxell Bach Cuadra,Cristina Granziera,Vincent Andrearczyk,Adrien Depeursinge
关键词-EN: explain deep learning, deep learning algorithms, classification tasks, developed to explain, explain deep
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To date, several methods have been developed to explain deep learning algorithms for classification tasks. Recently, an adaptation of two of such methods has been proposed to generate instance-level explainable maps in a semantic segmentation scenario, such as multiple sclerosis (MS) lesion segmentation. In the mentioned work, a 3D U-Net was trained and tested for MS lesion segmentation, yielding an F1 score of 0.7006, and a positive predictive value (PPV) of 0.6265. The distribution of values in explainable maps exposed some differences between maps of true and false positive (TP/FP) examples. Inspired by those results, we explore in this paper the use of characteristics of lesion-specific saliency maps to refine segmentation and detection scores. We generate around 21000 maps from as many TP/FP lesions in a batch of 72 patients (training set) and 4868 from the 37 patients in the test set. 93 radiomic features extracted from the first set of maps were used to train a logistic regression model and classify TP versus FP. On the test set, F1 score and PPV were improved by a large margin when compared to the initial model, reaching 0.7450 and 0.7817, with 95% confidence intervals of [0.7358, 0.7547] and [0.7679, 0.7962], respectively. These results suggest that saliency maps can be used to refine prediction scores, boosting a model’s performances.

[LG-84] Combining supervised and unsupervised learning methods to predict financial market movements

链接: https://arxiv.org/abs/2409.03762
作者: Gabriel Rodrigues Palma,Mariusz Skoczeń,Phil Maguire
关键词-EN: Gaussian Mixture Models, decisions traders make, exploited for profit, traders make, make to buy
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注: 22 pages

点击查看摘要

Abstract:The decisions traders make to buy or sell an asset depend on various analyses, with expertise required to identify patterns that can be exploited for profit. In this paper we identify novel features extracted from emergent and well-established financial markets using linear models and Gaussian Mixture Models (GMM) with the aim of finding profitable opportunities. We used approximately six months of data consisting of minute candles from the Bitcoin, Pepecoin, and Nasdaq markets to derive and compare the proposed novel features with commonly used ones. These features were extracted based on the previous 59 minutes for each market and used to identify predictions for the hour ahead. We explored the performance of various machine learning strategies, such as Random Forests (RF) and K-Nearest Neighbours (KNN) to classify market movements. A naive random approach to selecting trading decisions was used as a benchmark, with outcomes assumed to be equally likely. We used a temporal cross-validation approach using test sets of 40%, 30% and 20% of total hours to evaluate the learning algorithms’ performances. Our results showed that filtering the time series facilitates algorithms’ generalisation. The GMM filtering approach revealed that the KNN and RF algorithms produced higher average returns than the random algorithm.

信息检索

[IR-0] A Survey on Knowledge Organization Systems of Research Fields: Resources and Challenges

链接: https://arxiv.org/abs/2409.04432
作者: Angelo Salatino,Tanay Aggarwal,Andrea Mannocci,Francesco Osborne,Enrico Motta
关键词-EN: Knowledge Organization Systems, Organization Systems, Knowledge Organization, play a fundamental, role in categorising
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Knowledge Organization Systems (KOSs), such as term lists, thesauri, taxonomies, and ontologies, play a fundamental role in categorising, managing, and retrieving information. In the academic domain, KOSs are often adopted for representing research areas and their relationships, primarily aiming to classify research articles, academic courses, patents, books, scientific venues, domain experts, grants, software, experiment materials, and several other relevant products and agents. These structured representations of research areas, widely embraced by many academic fields, have proven effective in empowering AI-based systems to i) enhance retrievability of relevant documents, ii) enable advanced analytic solutions to quantify the impact of academic research, and iii) analyse and forecast research dynamics. This paper aims to present a comprehensive survey of the current KOS for academic disciplines. We analysed and compared 45 KOSs according to five main dimensions: scope, structure, curation, usage, and links to other KOSs. Our results reveal a very heterogeneous scenario in terms of scope, scale, quality, and usage, highlighting the need for more integrated solutions for representing research knowledge across academic fields. We conclude by discussing the main challenges and the most promising future directions.

[IR-1] How Fair is Your Diffusion Recommender Model?

链接: https://arxiv.org/abs/2409.04339
作者: Daniele Malitesta,Giacomo Medda,Erasmo Purificato,Ludovico Boratto,Fragkiskos D. Malliaros,Mirko Marras,Ernesto William De Luca
关键词-EN: generative adversarial networks, outperform traditional generative, Diffusion-based recommender systems, generative recommendation approaches, traditional generative recommendation
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Diffusion-based recommender systems have recently proven to outperform traditional generative recommendation approaches, such as variational autoencoders and generative adversarial networks. Nevertheless, the machine learning literature has raised several concerns regarding the possibility that diffusion models, while learning the distribution of data samples, may inadvertently carry information bias and lead to unfair outcomes. In light of this aspect, and considering the relevance that fairness has held in recommendations over the last few decades, we conduct one of the first fairness investigations in the literature on DiffRec, a pioneer approach in diffusion-based recommendation. First, we propose an experimental setting involving DiffRec (and its variant L-DiffRec) along with nine state-of-the-art recommendation models, two popular recommendation datasets from the fairness-aware literature, and six metrics accounting for accuracy and consumer/provider fairness. Then, we perform a twofold analysis, one assessing models’ performance under accuracy and recommendation fairness separately, and the other identifying if and to what extent such metrics can strike a performance trade-off. Experimental results from both studies confirm the initial unfairness warnings but pave the way for how to address them in future research directions.

[IR-2] Enhancing Sequential Music Recommendation with Personalized Popularity Awareness RECSYS’24

链接: https://arxiv.org/abs/2409.04329
作者: Davide Abbattista,Vito Walter Anelli,Tommaso Di Noia,Craig Macdonald,Aleksandr Vladimirovich Petrov
关键词-EN: systems have shown, shown promise, promise in capturing, capturing the dynamic, dynamic nature
类目: Information Retrieval (cs.IR)
*备注: Accepted by RecSys’24 as an LBR paper

点击查看摘要

Abstract:In the realm of music recommendation, sequential recommender systems have shown promise in capturing the dynamic nature of music consumption. Nevertheless, traditional Transformer-based models, such as SASRec and BERT4Rec, while effective, encounter challenges due to the unique characteristics of music listening habits. In fact, existing models struggle to create a coherent listening experience due to rapidly evolving preferences. Moreover, music consumption is characterized by a prevalence of repeated listening, i.e., users frequently return to their favourite tracks, an important signal that could be framed as individual or personalized popularity. This paper addresses these challenges by introducing a novel approach that incorporates personalized popularity information into sequential recommendation. By combining user-item popularity scores with model-generated scores, our method effectively balances the exploration of new music with the satisfaction of user preferences. Experimental results demonstrate that a Personalized Most Popular recommender, a method solely based on user-specific popularity, outperforms existing state-of-the-art models. Furthermore, augmenting Transformer-based models with personalized popularity awareness yields superior performance, showing improvements ranging from 25.2% to 69.8%. The code for this paper is available at this https URL. Comments: Accepted by RecSys’24 as an LBR paper Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2409.04329 [cs.IR] (or arXiv:2409.04329v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2409.04329 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3640457.3691719 Focus to learn more DOI(s) linking to related resources

[IR-3] WarpAdam: A new Adam optimizer based on Meta-Learning approach

链接: https://arxiv.org/abs/2409.04244
作者: Chengxi Pan,Junshang Chen,Jingrui Ye
关键词-EN: Adam optimizer, Adam, algorithms is crucial, optimizer, Meta Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Optimal selection of optimization algorithms is crucial for training deep learning models. The Adam optimizer has gained significant attention due to its efficiency and wide applicability. However, to enhance the adaptability of optimizers across diverse datasets, we propose an innovative optimization strategy by integrating the 'warped gradient descend’concept from Meta Learning into the Adam optimizer. In the conventional Adam optimizer, gradients are utilized to compute estimates of gradient mean and variance, subsequently updating model parameters. Our approach introduces a learnable distortion matrix, denoted as P, which is employed for linearly transforming gradients. This transformation slightly adjusts gradients during each iteration, enabling the optimizer to better adapt to distinct dataset characteristics. By learning an appropriate distortion matrix P, our method aims to adaptively adjust gradient information across different data distributions, thereby enhancing optimization performance. Our research showcases the potential of this novel approach through theoretical insights and empirical evaluations. Experimental results across various tasks and datasets validate the superiority of our optimizer that integrates the ‘warped gradient descend’ concept in terms of adaptability. Furthermore, we explore effective strategies for training the adaptation matrix P and identify scenarios where this method can yield optimal results. In summary, this study introduces an innovative approach that merges the ‘warped gradient descend’ concept from Meta Learning with the Adam optimizer. By introducing a learnable distortion matrix P within the optimizer, we aim to enhance the model’s generalization capability across diverse data distributions, thus opening up new possibilities in the field of deep learning optimization.

[IR-4] Refining Wikidata Taxonomy using Large Language Models

链接: https://arxiv.org/abs/2409.04056
作者: Yiwen Peng(IP Paris),Thomas Bonald(IP Paris),Mehwish Alam(IP Paris)
关键词-EN: Large Language Models, collaborative nature, taxonomic paths, presence of cycles, recurrent issues
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: ACM International Conference on Information and Knowledge Management, Oct 2024, Boise, Idaho, United States

点击查看摘要

Abstract:Due to its collaborative nature, Wikidata is known to have a complex taxonomy, with recurrent issues like the ambiguity between instances and classes, the inaccuracy of some taxonomic paths, the presence of cycles, and the high level of redundancy across classes. Manual efforts to clean up this taxonomy are time-consuming and prone to errors or subjective decisions. We present WiKC, a new version of Wikidata taxonomy cleaned automatically using a combination of Large Language Models (LLMs) and graph mining techniques. Operations on the taxonomy, such as cutting links or merging classes, are performed with the help of zero-shot prompting on an open-source LLM. The quality of the refined taxonomy is evaluated from both intrinsic and extrinsic perspectives, on a task of entity typing for the latter, showing the practical interest of WiKC.

[IR-5] RETAIN: Interactive Tool for Regression Testing Guided LLM Migration

链接: https://arxiv.org/abs/2409.03928
作者: Tanay Dixit,Daniel Lee,Sally Fang,Sai Sree Harsha,Anirudh Sureshan,Akash Maharaj,Yunyao Li
关键词-EN: Large Language Models, Large Language, LLM Migrations, Language Models, increasingly integrated
类目: Information Retrieval (cs.IR)
*备注: Preprint

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly integrated into diverse applications. The rapid evolution of LLMs presents opportunities for developers to enhance applications continuously. However, this constant adaptation can also lead to performance regressions during model migrations. While several interactive tools have been proposed to streamline the complexity of prompt engineering, few address the specific requirements of regression testing for LLM Migrations. To bridge this gap, we introduce RETAIN (REgression Testing guided LLM migrAtIoN), a tool designed explicitly for regression testing in LLM Migrations. RETAIN comprises two key components: an interactive interface tailored to regression testing needs during LLM migrations, and an error discovery module that facilitates understanding of differences in model behaviors. The error discovery module generates textual descriptions of various errors or differences between model outputs, providing actionable insights for prompt refinement. Our automatic evaluation and empirical user studies demonstrate that RETAIN, when compared to manual evaluation, enabled participants to identify twice as many errors, facilitated experimentation with 75% more prompts, and achieves 12% higher metric scores in a given time frame.

[IR-6] Understanding Fairness Metrics in Recommender Systems: A Healthcare Perspective

链接: https://arxiv.org/abs/2409.03893
作者: Veronica Kecki,Alan Said
关键词-EN: affect human lives, directly affect human, systems directly affect, critical concern, human lives
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: Accepted to the 18th ACM Conference on Recommender Systems

点击查看摘要

Abstract:Fairness in AI-driven decision-making systems has become a critical concern, especially when these systems directly affect human lives. This paper explores the public’s comprehension of fairness in healthcare recommendations. We conducted a survey where participants selected from four fairness metrics – Demographic Parity, Equal Accuracy, Equalized Odds, and Positive Predictive Value – across different healthcare scenarios to assess their understanding of these concepts. Our findings reveal that fairness is a complex and often misunderstood concept, with a generally low level of public understanding regarding fairness metrics in recommender systems. This study highlights the need for enhanced information and education on algorithmic fairness to support informed decision-making in using these systems. Furthermore, the results suggest that a one-size-fits-all approach to fairness may be insufficient, pointing to the importance of context-sensitive designs in developing equitable AI systems.

[IR-7] Its Not You Its Me: The Impact of Choice Models and Ranking Strategies on Gender Imbalance in Music Recommendation RECSYS2024

链接: https://arxiv.org/abs/2409.03781
作者: Andres Ferraro,Michael D. Ekstrand,Christine Bauer
关键词-EN: mitigation approaches, approaches are needed, needed to ensure, recommender systems, music recommender systems
类目: Information Retrieval (cs.IR)
*备注: 6 pages, 3 figures, conference short paper, to be published at RecSys 2024

点击查看摘要

Abstract:As recommender systems are prone to various biases, mitigation approaches are needed to ensure that recommendations are fair to various stakeholders. One particular concern in music recommendation is artist gender fairness. Recent work has shown that the gender imbalance in the sector translates to the output of music recommender systems, creating a feedback loop that can reinforce gender biases over time. In this work, we examine that feedback loop to study whether algorithmic strategies or user behavior are a greater contributor to ongoing improvement (or loss) in fairness as models are repeatedly re-trained on new user feedback data. We simulate user interaction and re-training to investigate the effects of ranking strategies and user choice models on gender fairness metrics. We find re-ranking strategies have a greater effect than user choice models on recommendation fairness over time.

[IR-8] VERA: Validation and Evaluation of Retrieval-Augmented Systems KDD2024

链接: https://arxiv.org/abs/2409.03759
作者: Tianyu Ding,Adi Banerjee,Laurent Mombaerts,Yunhong Li,Tarik Borogovac,Juan Pablo De la Cruz Weinstein
关键词-EN: necessitates stringent protocols, RAG systems accuracy, ensure RAG systems, applications necessitates stringent, Retrieval-Augmented Generation
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Accepted in Workshop on Evaluation and Trustworthiness of Generative AI Models, KDD 2024

点击查看摘要

Abstract:The increasing use of Retrieval-Augmented Generation (RAG) systems in various applications necessitates stringent protocols to ensure RAG systems accuracy, safety, and alignment with user intentions. In this paper, we introduce VERA (Validation and Evaluation of Retrieval-Augmented Systems), a framework designed to enhance the transparency and reliability of outputs from large language models (LLMs) that utilize retrieved information. VERA improves the way we evaluate RAG systems in two important ways: (1) it introduces a cross-encoder based mechanism that encompasses a set of multidimensional metrics into a single comprehensive ranking score, addressing the challenge of prioritizing individual metrics, and (2) it employs Bootstrap statistics on LLM-based metrics across the document repository to establish confidence bounds, ensuring the repositorys topical coverage and improving the overall reliability of retrieval systems. Through several use cases, we demonstrate how VERA can strengthen decision-making processes and trust in AI applications. Our findings not only contribute to the theoretical understanding of LLM-based RAG evaluation metric but also promote the practical implementation of responsible AI systems, marking a significant advancement in the development of reliable and transparent generative AI technologies.

附件下载

点击下载今日全部论文列表