本篇博文主要展示 2024-09-17 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天10:30左右邮件定时自动发送。

目录

概览 (2024-09-17)

今日共更新799篇论文,其中:

  • 自然语言处理103篇(Computation and Language (cs.CL))
  • 人工智能201篇(Artificial Intelligence (cs.AI))
  • 计算机视觉185篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习209篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
[NLP-0] RetrievalAttention:通过载体检索加速长上下文LLM推理

链接: https://arxiv.org/abs/2409.10516
作者: Di Liu,Meng Chen,Baotong Lu,Huiqiang Jiang,Zhenhua Han,Qianxi Zhang,Qi Chen,Chengruidong Zhang,Bailu Ding,Kai Zhang,Chen Chen,Fan Yang,Yuqing Yang,Lili Qiu
关键词-EN: Transformer-based large Language, large Language Models, Transformer-based large, large Language, Language Models
关键词-ZH: 基于转换器的大型语言、大型语言模型、基于转换器的大型、大型语言、语言模型
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 16 pages

点击查看摘要

Abstract:Transformer-based large Language Models (LLMs) become increasingly important in various domains. However, the quadratic time complexity of attention operation poses a significant challenge for scaling to longer contexts due to the extremely high inference latency and GPU memory consumption for caching key-value (KV) vectors. This paper proposes RetrievalAttention, a training-free approach to accelerate attention computation. To leverage the dynamic sparse property of attention, RetrievalAttention builds approximate nearest neighbor search (ANNS) indexes upon KV vectors in CPU memory and retrieves the most relevant ones via vector search during generation. Due to the out-of-distribution (OOD) between query vectors and key vectors, off-the-shelf ANNS indexes still need to scan O(N) (usually 30% of all keys) data for accurate retrieval, which fails to exploit the high sparsity. RetrievalAttention first identifies the OOD challenge of ANNS-based attention, and addresses it via an attention-aware vector search algorithm that can adapt to queries and only access 1–3% of data, thus achieving a sub-linear time complexity. RetrievalAttention greatly reduces the inference cost of long-context LLM with much lower GPU memory requirements while maintaining the model accuracy. Especially, RetrievalAttention only needs 16GB GPU memory for serving 128K tokens in LLMs with 8B parameters, which is capable of generating one token in 0.188 seconds on a single NVIDIA RTX4090 (24GB).
摘要:基于转换器的大型语言模型在各个领域中发挥着越来越重要的作用。然而,注意操作的二次时间复杂性对扩展到更长的上下文构成了巨大的挑战,因为缓存键值(KV)向量需要极高的推理延迟和GPU内存消耗。本文提出了一种无需训练的加速注意力计算的方法–RetrivalAttension.为了利用注意力的动态稀疏性,RetrivalAttendant在CPU内存中的KV向量上建立近似最近邻搜索(ANS)索引,并在生成过程中通过向量搜索来检索最相关的索引。由于查询向量和关键字向量之间的分布不一致,现有的神经网络索引仍然需要扫描O(N)个数据(通常占所有关键字的30%)才能进行准确的检索,这无法利用高度的稀疏性。该算法首先识别基于人工神经网络的注意力的面向对象设计挑战,并通过一种注意力感知的向量搜索算法来解决这一挑战,该算法能够适应查询,并且只访问1%-3%的数据,从而实现了亚线性时间复杂度。该算法在保持模型精度的前提下,大大降低了长上下文LLM算法的推理代价,同时大大降低了对GPU的内存需求。特别是,RetrivalAttendant只需要16 GB的GPU内存就可以在带有8B参数的LLMS中服务128K令牌,这能够在单个NVIDIA RTX4090(24 GB)上在0.188秒内生成一个令牌。

[NLP-1] DILA: Dictionary Label Attention for Mechanistic Interpretability in High-dimensional Multi-label Medical Coding Prediction
[NLP-1] DILA:多维多标签医疗编码预测中机械解释性的词典标签关注

链接: https://arxiv.org/abs/2409.10504
作者: John Wu,David Wu,Jimeng Sun
关键词-EN: Predicting high-dimensional, requires both accuracy, high-dimensional or extreme, Predicting, DIctionary Label Attention
关键词-ZH: 预测多维,需要准确性、多维或极端,预测,指导标签注意力
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Predicting high-dimensional or extreme multilabels, such as in medical coding, requires both accuracy and interpretability. Existing works often rely on local interpretability methods, failing to provide comprehensive explanations of the overall mechanism behind each label prediction within a multilabel set. We propose a mechanistic interpretability module called DIctionary Label Attention (\method) that disentangles uninterpretable dense embeddings into a sparse embedding space, where each nonzero element (a dictionary feature) represents a globally learned medical concept. Through human evaluations, we show that our sparse embeddings are more human understandable than its dense counterparts by at least 50 percent. Our automated dictionary feature identification pipeline, leveraging large language models (LLMs), uncovers thousands of learned medical concepts by examining and summarizing the highest activating tokens for each dictionary feature. We represent the relationships between dictionary features and medical codes through a sparse interpretable matrix, enhancing the mechanistic and global understanding of the model’s predictions while maintaining competitive performance and scalability without extensive human annotation.
摘要:预测高维或极端多标签,例如在医学编码中,既需要准确性,也需要可解释性。现有的工作往往依赖于局部可解释性方法,不能全面解释多标签集合中每个标签预测背后的整体机制。我们提出了一种称为词典标签注意(\方法)的机械可解释性模块,它将不可解释的密集嵌入分解到稀疏嵌入空间中,其中每个非零元素(词典特征)代表一个全局学习的医学概念。通过人类评估,我们表明我们的稀疏嵌入比密集嵌入至少可被人类理解50%。我们的自动化词典功能识别流水线利用大型语言模型(LLM),通过检查和汇总每个词典功能的最高活跃度标记来发现数千个学习的医学概念。我们通过一个稀疏的可解释矩阵来表示词典特征和医疗代码之间的关系,增强了对模型预测的机械性和全局理解,同时保持了具有竞争力的性能和可扩展性,而不需要大量的人工注释。

[NLP-2] Causal Language Modeling Can Elicit Search and Reasoning Capabilities on Logic Puzzles
[NLP-2] 因果语言建模可以增强逻辑谜题的搜索和推理能力

链接: https://arxiv.org/abs/2409.10502
作者: Kulin Shah,Nishanth Dikkala,Xin Wang,Rina Panigrahy
关键词-EN: Large Language Models, Causal language modeling, yielded remarkable capabilities, Large Language, Causal language
关键词-ZH: 大型语言模型,因果语言建模,产生了非凡的能力,大型语言,因果语言
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 26 pages

点击查看摘要

Abstract:Causal language modeling using the Transformer architecture has yielded remarkable capabilities in Large Language Models (LLMs) over the last few years. However, the extent to which fundamental search and reasoning capabilities emerged within LLMs remains a topic of ongoing debate. In this work, we study if causal language modeling can learn a complex task such as solving Sudoku puzzles. To solve a Sudoku, the model is first required to search over all empty cells of the puzzle to decide on a cell to fill and then apply an appropriate strategy to fill the decided cell. Sometimes, the application of a strategy only results in thinning down the possible values in a cell rather than concluding the exact value of the cell. In such cases, multiple strategies are applied one after the other to fill a single cell. We observe that Transformer models trained on this synthetic task can indeed learn to solve Sudokus (our model solves 94.21% of the puzzles fully correctly) when trained on a logical sequence of steps taken by a solver. We find that training Transformers with the logical sequence of steps is necessary and without such training, they fail to learn Sudoku. We also extend our analysis to Zebra puzzles (known as Einstein puzzles) and show that the model solves 92.04 % of the puzzles fully correctly. In addition, we study the internal representations of the trained Transformer and find that through linear probing, we can decode information about the set of possible values in any given cell from them, pointing to the presence of a strong reasoning engine implicit in the Transformer weights.
摘要:在过去的几年中,使用Transformer架构的因果语言建模在大型语言模型(LLM)中产生了显著的能力。然而,在LLMS中出现的基本搜索和推理能力在多大程度上仍然是一个持续辩论的话题。在这项工作中,我们研究因果语言建模是否可以学习一项复杂的任务,如解决数独谜题。要解决数独问题,模型首先需要搜索谜题中的所有空单元格,以决定要填充的单元格,然后应用适当的策略来填充已决定的单元格。有时,策略的应用只会减少单元格中可能的值,而不是得出单元格的确切值。在这种情况下,一个接一个地应用多个策略来填充单个单元格。我们观察到,在这个合成任务中训练的Transformer模型在解算器所采取的步骤的逻辑序列上训练时,确实可以学习求解数独(我们的模型完全正确地解决了94.21%的谜题)。我们发现,用逻辑步骤顺序来训练变形金刚是必要的,如果没有这样的训练,他们就无法学习数独。我们还将我们的分析扩展到斑马难题(称为爱因斯坦难题),并表明该模型完全正确地解决了92.04%的难题。此外,我们研究了训练好的Transformer的内部表示,发现通过线性探测,我们可以从它们中解码关于任何给定单元中可能值的集合的信息,这表明Transformer权重中隐含着一个强大的推理引擎。

[NLP-3] Incorporating Classifier-Free Guidance in Diffusion Model-Based Recommendation
[NLP-3] 在基于扩散模型的推荐中引入无分类器指导

链接: https://arxiv.org/abs/2409.10494
作者: Noah Buchanan,Susan Gauch,Quan Mai
关键词-EN: Generative Adversarial Networks, recommender system, diffusion-based recommender system, recommender, classifier-free guidance
关键词-ZH: 生成对抗网络、推荐系统、基于扩散的推荐系统、推荐器、无分类器指导
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 8 pages

点击查看摘要

Abstract:This paper presents a diffusion-based recommender system that incorporates classifier-free guidance. Most current recommender systems provide recommendations using conventional methods such as collaborative or content-based filtering. Diffusion is a new approach to generative AI that improves on previous generative AI approaches such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs). We incorporate diffusion in a recommender system that mirrors the sequence users take when browsing and rating items. Although a few current recommender systems incorporate diffusion, they do not incorporate classifier-free guidance, a new innovation in diffusion models as a whole. In this paper, we present a diffusion recommender system that augments the underlying recommender system model for improved performance and also incorporates classifier-free guidance. Our findings show improvements over state-of-the-art recommender systems for most metrics for several recommendation tasks on a variety of datasets. In particular, our approach demonstrates the potential to provide better recommendations when data is sparse.
摘要:本文提出了一种结合无分类器指导的基于扩散的推荐系统。大多数当前的推荐系统使用诸如协作或基于内容的过滤等传统方法来提供推荐。扩散是一种新的生成性人工智能方法,它改进了以往的生成性人工智能方法,如变分自动编码器(VAEs)和生成性对抗网络(GANS)。我们在一个推荐系统中加入了扩散,该系统反映了用户在浏览和评级项目时采取的顺序。尽管目前的一些推荐系统包含了扩散,但它们没有纳入无分类器指导,这是扩散模型中的一项新创新。在本文中,我们提出了一个扩散推荐系统,它增强了底层推荐系统的模型以提高性能,并结合了无分类器的指导。我们的发现表明,在各种数据集上的几个推荐任务的大多数指标上,我们的推荐系统都比最先进的推荐系统有所改进。特别是,我们的方法展示了在数据稀疏的情况下提供更好建议的潜力。

[NLP-4] Schrodingers Memory: Large Language Models
[NLP-4] 施罗德记忆:大型语言模型

链接: https://arxiv.org/abs/2409.10482
作者: Wei Wang,Qing Li
关键词-EN: LLMs’ functionality, foundation of LLMs’, past research, research has lacked, lacked an in-depth
关键词-ZH: LLM的功能,LLM的基础,过去的研究,研究缺乏,缺乏深入
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Memory is the foundation of LLMs’ functionality, yet past research has lacked an in-depth exploration of their memory capabilities and underlying theory. In this paper, we apply UAT theory to explain the memory mechanism of LLMs and propose a new approach for evaluating LLM performance by comparing the memory capacities of different models. Through extensive experiments, we validate our theory and the memory abilities of LLMs. Finally, we compare the capabilities of the human brain and LLMs, highlighting both their similarities and differences in terms of working mechanisms.
摘要:记忆是LLM功能的基础,但过去的研究缺乏对其记忆能力和基础理论的深入探索。本文应用UAT理论来解释LLM的记忆机制,并提出一种通过比较不同模型的记忆容量来评估LLM性能的新方法。通过大量的实验,我们验证了我们的理论和LLM的记忆能力。最后,我们比较了人类大脑和LLM的能力,强调了它们在工作机制方面的相似之处和差异。

[NLP-5] A Knowledge-Enhanced Disease Diagnosis Method Based on Prompt Learning and BERT Integration
[NLP-5] 基于即时学习和BERT集成的知识增强型疾病诊断方法

链接: https://arxiv.org/abs/2409.10403
作者: Zhang Zheng
关键词-EN: prompt learning framework, diagnosis method based, knowledge-enhanced disease diagnosis, learning framework, paper proposes
关键词-ZH: 提示学习框架,基于诊断方法,知识增强疾病诊断,学习框架,论文提出
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Knowledge Enhancement,Disease Diagnosis,Prompt Learning,BERT,Knowledge Graph

点击查看摘要

Abstract:This paper proposes a knowledge-enhanced disease diagnosis method based on a prompt learning framework. The method retrieves structured knowledge from external knowledge graphs related to clinical cases, encodes it, and injects it into the prompt templates to enhance the language model’s understanding and reasoning capabilities for the task.We conducted experiments on three public datasets: CHIP-CTC, IMCS-V2-NER, and KUAKE-QTR. The results show that the proposed method significantly outperforms existing models across multiple evaluation metrics, with an F1 score improvement of 2.4% on the CHIP-CTC dataset, 3.1% on the IMCS-V2-NER dataset,and 4.2% on the KUAKE-QTR dataset. Additionally,ablation studies confirmed the critical role of the knowledge injection module,as the removal of this module resulted in a significant drop in F1 score. The experimental results demonstrate that the proposed method not only effectively improves the accuracy of disease diagnosis but also enhances the interpretability of the predictions, providing more reliable support and evidence for clinical diagnosis.
摘要:提出了一种基于快速学习框架的知识增强型疾病诊断方法。该方法从与临床病例相关的外部知识图中检索结构化知识,对其进行编码,并将其注入到提示模板中,以增强语言模型对任务的理解和推理能力。结果表明,该方法在多个评价指标上显著优于已有的模型,在CHIP-CTC数据集上的F1得分提高了2.4%,在IMCS-V2-NER数据集上的F1得分提高了3.1%,在KUAKE-QTR数据集上的F1得分提高了4.2%。此外,消融研究证实了知识注入模块的关键作用,因为移除该模块导致F1分数显著下降。实验结果表明,该方法不仅有效地提高了疾病诊断的准确性,而且增强了预测的可解释性,为临床诊断提供了更可靠的支持和依据。

[NLP-6] Instigating Cooperation among LLM Agents Using Adaptive Information Modulation
[NLP-6] 使用自适应信息调制激发LLM代理之间的合作

链接: https://arxiv.org/abs/2409.10372
作者: Qiliang Chen,Alireza(Sepehr)Ilami,Nunzio Lore,Babak Heydari
关键词-EN: combining LLM agents, framework combining LLM, evolving strategic interactions, combining LLM, human strategic behavior
关键词-ZH: 结合LLM代理、结合LLM的框架、不断发展的战略互动、结合LLM、人类战略行为
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:This paper introduces a novel framework combining LLM agents as proxies for human strategic behavior with reinforcement learning (RL) to engage these agents in evolving strategic interactions within team environments. Our approach extends traditional agent-based simulations by using strategic LLM agents (SLA) and introducing dynamic and adaptive governance through a pro-social promoting RL agent (PPA) that modulates information access across agents in a network, optimizing social welfare and promoting pro-social behavior. Through validation in iterative games, including the prisoner dilemma, we demonstrate that SLA agents exhibit nuanced strategic adaptations. The PPA agent effectively learns to adjust information transparency, resulting in enhanced cooperation rates. This framework offers significant insights into AI-mediated social dynamics, contributing to the deployment of AI in real-world team settings.
摘要:本文介绍了一种新颖的框架,将LLM代理作为人类战略行为的代理与强化学习(RL)相结合,以使这些代理参与团队环境内不断发展的战略互动。我们的方法通过使用战略LLM代理(SLA)并通过亲社会促进RL代理(PPA)引入动态和适应性治理来扩展传统的基于代理的模拟,该代理调节网络中代理之间的信息访问,优化社会福利并促进亲社会行为。通过迭代游戏(包括囚徒困境)中的验证,我们证明SLA代理表现出细致入微的战略适应。PPA代理有效地学会调整信息透明度,从而提高合作率。该框架为人工智能介导的社会动态提供了重要见解,有助于在现实世界的团队环境中部署人工智能。

[NLP-7] 2D or not 2D: How Does the Dimensionality of Gesture Representation Affect 3D Co-Speech Gesture Generation?
[NLP-7] 2D或不是2D:手势表示的抽象性如何影响3D同声手势生成?

链接: https://arxiv.org/abs/2409.10357
作者: Téo Guichoux,Laure Soulier,Nicolas Obin,Catherine Pelachaud
关键词-EN: Embodied Conversational Agents, fundamental for communication, synchronous co-speech gestures, Co-speech gestures, Conversational Agents
关键词-ZH: 协作对话代理,通信的基础,同步共语音手势,共语音手势,对话代理
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Co-speech gestures are fundamental for communication. The advent of recent deep learning techniques has facilitated the creation of lifelike, synchronous co-speech gestures for Embodied Conversational Agents. “In-the-wild” datasets, aggregating video content from platforms like YouTube via human pose detection technologies, provide a feasible solution by offering 2D skeletal sequences aligned with speech. Concurrent developments in lifting models enable the conversion of these 2D sequences into 3D gesture databases. However, it is important to note that the 3D poses estimated from the 2D extracted poses are, in essence, approximations of the ground-truth, which remains in the 2D domain. This distinction raises questions about the impact of gesture representation dimensionality on the quality of generated motions - a topic that, to our knowledge, remains largely unexplored. Our study examines the effect of using either 2D or 3D joint coordinates as training data on the performance of speech-to-gesture deep generative models. We employ a lifting model for converting generated 2D pose sequences into 3D and assess how gestures created directly in 3D stack up against those initially generated in 2D and then converted to 3D. We perform an objective evaluation using widely used metrics in the gesture generation field as well as a user study to qualitatively evaluate the different approaches.
摘要:共话手势是交流的基础。最近深度学习技术的出现促进了为具体化的会话代理创建逼真的、同步的协同语音手势。通过人体姿势检测技术从YouTube等平台上聚合视频内容的“野外”数据集,通过提供与语音比对的2D骨骼序列,提供了一个可行的解决方案。提升模型的同时开发使这些2D序列能够转换成3D手势数据库。然而,重要的是要注意,从提取的2D姿势估计的3D姿势本质上是保持在2D域中的地面真实的近似。这种区别引发了关于手势表征维度对生成运动质量的影响的问题–据我们所知,这个问题在很大程度上仍未被探索。我们的研究考察了使用2D或3D联合坐标作为训练数据对语音到手势深度生成模型性能的影响。我们使用提升模型将生成的2D姿势序列转换为3D,并评估直接在3D中创建的手势与最初在2D中生成的手势如何堆叠,然后转换为3D。我们使用手势生成领域中广泛使用的度量标准进行了客观评估,并通过用户研究对不同方法进行了定性评估。

[NLP-8] Detecting Sexism in German Online Newspaper Comments with Open-Source Text Embeddings (Team GDA GermEval2024 Shared Task 1: GerMS-Detect Subtasks 1 and 2 Closed Track)
[NLP-8] 使用开源文本嵌入检测德国在线报纸评论中的性别歧视(GDA GermEval 2024团队共享任务1:GerMS-Detect子任务1和2封闭轨道)

链接: https://arxiv.org/abs/2409.10341
作者: Florian Bremm,Patrick Gustav Blaneck,Tobias Bornheim,Niklas Grieger,Stephan Bialonski
关键词-EN: complicating moderation efforts, online media comments, German-language online comments, manifests subtly, complicating moderation
关键词-ZH: 使审核工作变得复杂、在线媒体评论、德语在线评论都微妙地表现出来,使审核变得复杂
类目: Computation and Language (cs.CL)
备注: 6 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Sexism in online media comments is a pervasive challenge that often manifests subtly, complicating moderation efforts as interpretations of what constitutes sexism can vary among individuals. We study monolingual and multilingual open-source text embeddings to reliably detect sexism and misogyny in German-language online comments from an Austrian newspaper. We observed classifiers trained on text embeddings to mimic closely the individual judgements of human annotators. Our method showed robust performance in the GermEval 2024 GerMS-Detect Subtask 1 challenge, achieving an average macro F1 score of 0.597 (4th place, as reported on Codabench). It also accurately predicted the distribution of human annotations in GerMS-Detect Subtask 2, with an average Jensen-Shannon distance of 0.301 (2nd place). The computational efficiency of our approach suggests potential for scalable applications across various languages and linguistic contexts.
摘要:网络媒体评论中的性别歧视是一个普遍存在的挑战,它经常微妙地表现出来,使节制工作变得复杂,因为对性别歧视构成的解释因人而异。我们研究单语和多语言开源文本嵌入,以可靠地检测奥地利报纸德语在线评论中的性别歧视和厌女症。我们观察到经过文本嵌入训练的分类器,以密切模仿人类注释者的个人判断。我们的方法在GermEval 2024 GerMS-Detect Subtask 1挑战中表现出稳健的性能,平均宏F1评分为0.597(第四位,正如Codabench上的报道)。它还准确预测了GerMS-Detect子任务2中人类注释的分布,平均Jensen-Shannon距离为0.301(第二名)。我们方法的计算效率表明,跨各种语言和语言上下文的可扩展应用程序有潜力。

[NLP-9] he 20 questions game to distinguish large language models
[NLP-9] 20个问题游戏来区分大型语言模型

链接: https://arxiv.org/abs/2409.10338
作者: Gurvan Richardeau,Erwan Le Merrer,Camilla Penzo,Gilles Tredan
关键词-EN: large language models, black-box context, large language, questions game, questions
关键词-ZH: 大型语言模型、黑匣子上下文、大型语言、问题游戏、问题
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In a parallel with the 20 questions game, we present a method to determine whether two large language models (LLMs), placed in a black-box context, are the same or not. The goal is to use a small set of (benign) binary questions, typically under 20. We formalize the problem and first establish a baseline using a random selection of questions from known benchmark datasets, achieving an accuracy of nearly 100% within 20 questions. After showing optimal bounds for this problem, we introduce two effective questioning heuristics able to discriminate 22 LLMs by using half as many questions for the same task. These methods offer significant advantages in terms of stealth and are thus of interest to auditors or copyright owners facing suspicions of model leaks.
摘要:与20个问题游戏并行,我们提出了一种方法来确定置于黑匣子上下文中的两个大型语言模型(LLM)是否相同。目标是使用一小组(良性)二元问题,通常在20以下。我们将问题形式化,并首先使用从已知基准数据集中随机选择的问题来建立基线,在20个问题内实现近100%的准确率。在给出该问题的最佳边界后,我们引入了两种有效的提问启发式方法,能够通过对同一任务使用一半数量的问题来区分22个LLM。这些方法在隐蔽性方面具有显着优势,因此受到怀疑模型泄露的审计员或版权所有者的兴趣。

[NLP-10] MGSA: Multi-granularity Graph Structure Attention for Knowledge Graph-to-Text Generation
[NLP-10] MGSA:多粒度图结构关注知识图到文本生成

链接: https://arxiv.org/abs/2409.10294
作者: Shanshan Wang,Chun Zhang,Ning Zhang
关键词-EN: convert structured knowledge, structured knowledge graphs, Generation task aims, human-readable natural language, knowledge graph structure
关键词-ZH: 转换结构化知识、结构化知识图、生成任务目标、人类可读自然语言、知识图结构
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Knowledge Graph-to-Text Generation task aims to convert structured knowledge graphs into coherent and human-readable natural language text. Recent efforts in this field have focused on enhancing pre-trained language models (PLMs) by incorporating graph structure information to capture the intricate structure details of knowledge graphs. However, most of these approaches tend to capture only single-granularity structure information, concentrating either on the relationships between entities within the original graph or on the relationships between words within the same entity or across different entities. This narrow focus results in a significant limitation: models that concentrate solely on entity-level structure fail to capture the nuanced semantic relationships between words, while those that focus only on word-level structure overlook the broader relationships between original entire entities. To overcome these limitations, this paper introduces the Multi-granularity Graph Structure Attention (MGSA), which is based on PLMs. The encoder of the model architecture features an entity-level structure encoding module, a word-level structure encoding module, and an aggregation module that synthesizes information from both structure. This multi-granularity structure encoding approach allows the model to simultaneously capture both entity-level and word-level structure information, providing a more comprehensive understanding of the knowledge graph’s structure information, thereby significantly improving the quality of the generated text. We conducted extensive evaluations of the MGSA model using two widely recognized KG-to-Text Generation benchmark datasets, WebNLG and EventNarrative, where it consistently outperformed models that rely solely on single-granularity structure information, demonstrating the effectiveness of our approach.
摘要:知识图到文本的生成任务旨在将结构化的知识图转换为连贯的、人类可读的自然语言文本。最近在这一领域的努力集中在通过结合图结构信息来捕捉知识图的复杂结构细节来增强预训练语言模型(PLM)。然而,这些方法中的大多数倾向于只捕获单粒度结构信息,集中在原始图形中的实体之间的关系上,或者集中在同一实体内或不同实体之间的单词之间的关系上。这种狭隘的关注导致了一个显著的限制:只关注实体级结构的模型无法捕捉词之间细微的语义关系,而只关注词级结构的模型忽略了原始整个实体之间的更广泛关系。为了克服这些局限性,本文引入了基于PLM的多粒度图结构注意力(MGSA)。该模型体系结构的编码器具有实体级结构编码模块、词级结构编码模块和从这两种结构合成信息的聚合模块。这种多粒度结构编码方法允许模型同时捕获实体级和词级结构信息,提供了对知识图结构信息的更全面的理解,从而显著提高了生成文本的质量。我们使用两个公认的KG-to-Text生成基准数据集WebNLG和EventNarrative对MGSA模型进行了广泛的评估,在这些基准数据集中,它的性能始终优于仅依赖单粒度结构信息的模型,从而证明了我们方法的有效性。

[NLP-11] ReflectDiffu: Reflect between Emotion-intent Contagion and Mimicry for Empathetic Response Generation via a RL-Diffusion Framework
[NLP-11] ReflectDistu:通过RL扩散框架在描述意图传染和模仿之间进行反思,以生成同理心反应

链接: https://arxiv.org/abs/2409.10289
作者: Jiahao Yuan,Zixiang Di,Zhiqing Cui,Guisong Yang,Usman Naseem
关键词-EN: foster meaningful interactions, Empathetic response generation, meaningful interactions, response generation necessitates, necessitates the integration
关键词-ZH: 促进有意义的互动,同理心的反应生成,有意义的互动,反应生成是必要的,需要整合
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Empathetic response generation necessitates the integration of emotional and intentional dynamics to foster meaningful interactions. Existing research either neglects the intricate interplay between emotion and intent, leading to suboptimal controllability of empathy, or resorts to large language models (LLMs), which incur significant computational overhead. In this paper, we introduce ReflectDiffu, a lightweight and comprehensive framework for empathetic response generation. This framework incorporates emotion contagion to augment emotional expressiveness and employs an emotion-reasoning mask to pinpoint critical emotional elements. Additionally, it integrates intent mimicry within reinforcement learning for refinement during diffusion. By harnessing an intent twice reflect the mechanism of Exploring-Sampling-Correcting, ReflectDiffu adeptly translates emotional decision-making into precise intent actions, thereby addressing empathetic response misalignments stemming from emotional misrecognition. Through reflection, the framework maps emotional states to intents, markedly enhancing both response empathy and flexibility. Comprehensive experiments reveal that ReflectDiffu outperforms existing models regarding relevance, controllability, and informativeness, achieving state-of-the-art results in both automatic and human evaluations.
摘要:移情反应的产生需要情绪和意向动力的整合,以促进有意义的互动。现有的研究要么忽略了情绪和意图之间复杂的相互作用,导致移情的可控性不佳,要么求助于大型语言模型(LLM),这会带来巨大的计算开销。在本文中,我们介绍了ReflectDiffu,一个轻量级的全面的同理心反应生成框架。该框架结合了情绪传染来增强情绪的表现力,并使用情绪推理面具来精确定位关键情绪元素。此外,它将意图模仿集成到强化学习中,以在扩散过程中进行优化。通过利用两次反映探索-采样-纠正机制的意图,ReflectDiffu巧妙地将情绪决策转化为精确的意图行动,从而解决了因情绪识别错误而产生的同理心反应错误。通过反思,该框架将情感状态映射到意图,显著增强了反应的同理心和灵活性。综合实验表明,ReflectDiffu在相关性、可控性和信息性方面优于现有模型,在自动和人工评估方面都取得了最先进的结果。

[NLP-12] From Text to Emoji: How PEFT-Driven Personality Manipulation Unleashes the Emoji Potential in LLMs NEURIPS2024
[NLP-12] 从文本到Emoji:PEFT驱动的人格操纵如何释放LLM中的Emoji潜力

链接: https://arxiv.org/abs/2409.10245
作者: Navya Jain,Zekun Wu,Cristian Munoz,Airlie Hilliard,Adriano Koshiyama,Emre Kazim,Philip Treleaven
关键词-EN: Model Editor Networks, In-Context Knowledge Editing, continues to grow, area of research, demand for human-like
关键词-ZH: 模型编辑器网络、上下文知识编辑、持续发展、研究领域、对类人的需求
类目: Computation and Language (cs.CL)
备注: Submitted to NeurIPS 2024 Workshop on Behavioral Machine Learning

点击查看摘要

Abstract:As the demand for human-like interactions with LLMs continues to grow, so does the interest in manipulating their personality traits, which has emerged as a key area of research. Methods like prompt-based In-Context Knowledge Editing (IKE) and gradient-based Model Editor Networks (MEND) have been explored but show irregularity and variability. IKE depends on the prompt, leading to variability and sensitivity, while MEND yields inconsistent and gibberish outputs. To address this, we employed Opinion QA Based Parameter-Efficient Fine-Tuning (PEFT), specifically Quantized Low-Rank Adaptation (QLORA), to manipulate the Big Five personality traits: Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism. After PEFT, models such as Mistral-7B-Instruct and Llama-2-7B-chat began generating emojis, despite their absence in the PEFT data. For instance, Llama-2-7B-chat generated emojis in 99.5% of extraversion-related test instances, while Mistral-8B-Instruct did so in 92.5% of openness-related test instances. Explainability analysis indicated that the LLMs used emojis intentionally to express these traits. This paper provides a number of novel contributions. First, introducing an Opinion QA dataset for PEFT-driven personality manipulation; second, developing metric models to benchmark LLM personality traits; third, demonstrating PEFT’s superiority over IKE in personality manipulation; and finally, analyzing and validating emoji usage through explainability methods such as mechanistic interpretability and in-context learning explainability methods.
摘要:随着与LLMS进行类似人类互动的需求持续增长,操纵他们的人格特征的兴趣也在不断增长,这已经成为一个关键的研究领域。已经探索了基于提示的上下文中知识编辑(IKE)和基于梯度的模型编辑网络(MEND)等方法,但显示出不规则性和可变性。IKE依赖于提示,导致了可变性和敏感性,而MEND产生的结果不一致和含糊其辞。为了解决这一问题,我们采用了基于意见QA的参数效率微调(PEFT),特别是量化低等级适应(QLORA),来操纵五大人格特质:开放性、尽职尽责、外向型、宜人性和神经质。在PEFT之后,米斯特拉尔-7B-指令和Llama-2-7B-Chat等模型开始生成表情符号,尽管它们没有出现在PEFT数据中。例如,Llama-2-7B-Chat在99.5%的外向性相关测试实例中生成表情符号,而Mistral-8B-Indict在92.5%的开放相关测试实例中生成表情符号。可解释性分析表明,LLMS故意使用表情符号来表达这些特征。本文提供了一些新的贡献。首先,介绍了用于PEFT驱动的人格操纵的观点问答数据集;其次,开发了衡量LLM人格特质的度量模型;第三,展示了PEFT在人格操纵方面优于IKE;最后,通过机械解释和情境学习解释方法等解释性方法分析和验证了表情符号的使用。

[NLP-13] Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models
[NLP-13] 匹配和修剪:针对多模式大型语言模型的快速且免训练的视觉标记修剪

链接: https://arxiv.org/abs/2409.10197
作者: Weihao Ye,Qiong Wu,Wenhao Lin,Yiyi Zhou
关键词-EN: Large Language Models, Multimodal Large Language, Language Models, exhibits obvious redundancy, progress in Multimodal
关键词-ZH: 大型语言模型,多模式大型语言,语言模型,表现出明显的冗余,多模式的进步
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Recent progress in Multimodal Large Language Models(MLLMs) often use large image tokens to compensate the visual shortcoming of MLLMs, which not only exhibits obvious redundancy but also greatly exacerbates the already high computation. Token pruning is an effective solution for speeding up MLLMs, but when and how to drop tokens still remains a challenge. In this paper, we propose a novel and training-free approach for the effective visual token pruning of MLLMs, termed FitPrune, which can quickly produce a complete pruning recipe for MLLMs according to a pre-defined budget. Specifically, FitPrune considers token pruning as a statistical problem of MLLM and its objective is to find out an optimal pruning scheme that can minimize the divergence of the attention distributions before and after pruning. In practice, FitPrune can be quickly accomplished based on the attention statistics from a small batch of inference data, avoiding the expensive trials of MLLMs. According to the pruning recipe, an MLLM can directly remove the redundant visual tokens of different examples during inference. To validate FitPrune, we apply it to a set of recent MLLMs, including LLaVA-1.5, LLaVA-HR and LLaVA-NEXT, and conduct extensive experiments on a set of benchmarks. The experimental results show that our FitPrune can not only reduce the computational complexity to a large extent, while retaining high performance, e.g., -54.9% FLOPs for LLaVA-NEXT with only 0.5% accuracy drop. Notably, the pruning recipe can be obtained in about 5 minutes. Our code is available at this https URL.
摘要:近年来,多通道大语言模型(MLLMS)的研究进展往往使用大图像标记来弥补MLLMS的视觉缺陷,这不仅表现出明显的冗余性,而且大大加剧了已经很高的计算量。令牌修剪是加速MLLMS的有效解决方案,但何时以及如何丢弃令牌仍然是一个挑战。本文提出了一种新的、无需训练的MLLMS视觉令牌剪枝方法,称为FitPrune,它可以根据预先定义的预算快速生成一个完整的MLLMS剪枝配方。具体地说,FitPrune将令牌剪枝视为MLLM的一个统计问题,其目标是找出一种最优的剪枝方案,使剪枝前后的注意力分布的发散性最小。在实践中,基于小批量推理数据的注意力统计,FitPrune可以快速完成,避免了昂贵的MLLMS试验。根据剪枝公式,MLLM可以在推理过程中直接去除不同示例的冗余视觉标记。为了验证FitPrune,我们将其应用于最近的一组MLLM,包括LLaVA-1.5,LLaVA-HR和LLaVA-Next,并在一组基准上进行了广泛的实验。实验结果表明,FitPrune算法在保持较高性能的同时,大大降低了计算复杂度,例如LLaVA-NEXT算法在准确率下降0.5%的情况下,FitPrune算法的性能下降了54.9%。值得注意的是,修剪配方可以在大约5分钟内获得。我们的代码可以在这个HTTPS URL上找到。

[NLP-14] LLMs for clinical risk prediction
[NLP-14] 用于临床风险预测的LLM

链接: https://arxiv.org/abs/2409.10191
作者: Mohamed Rezk,Patricia Cabanillas Silva,Fried-Michael Dahlweid
关键词-EN: clinalytix Medical, study compares, compares the efficacy, delirium development, Medical AI demonstrated
关键词-ZH: clinalytix Medical,研究比较,比较疗效,谵妄发展,医疗人工智能证明
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study compares the efficacy of GPT-4 and clinalytix Medical AI in predicting the clinical risk of delirium development. Findings indicate that GPT-4 exhibited significant deficiencies in identifying positive cases and struggled to provide reliable probability estimates for delirium risk, while clinalytix Medical AI demonstrated superior accuracy. A thorough analysis of the large language model’s (LLM) outputs elucidated potential causes for these discrepancies, consistent with limitations reported in extant literature. These results underscore the challenges LLMs face in accurately diagnosing conditions and interpreting complex clinical data. While LLMs hold substantial potential in healthcare, they are currently unsuitable for independent clinical decision-making. Instead, they should be employed in assistive roles, complementing clinical expertise. Continued human oversight remains essential to ensure optimal outcomes for both patients and healthcare providers.
摘要:本研究比较了GPT-4和clinalytix Medical AI在预测谵妄发展临床风险方面的功效。研究结果表明,GPT-4在识别阳性病例方面表现出显着缺陷,并且难以提供可靠的谵妄风险概率估计,而clinalytix Medical AI表现出更高的准确性。对大型语言模型(LLM)输出的彻底分析阐明了这些差异的潜在原因,这与现有文献中报告的局限性一致。这些结果凸显了LLC在准确诊断疾病和解释复杂临床数据方面面临的挑战。虽然LLM在医疗保健领域拥有巨大潜力,但目前它们不适合独立临床决策。相反,他们应该担任辅助角色,补充临床专业知识。持续的人员监督对于确保患者和医疗保健提供者的最佳结果仍然至关重要。

[NLP-15] Augmenting Automatic Speech Recognition Models with Disfluency Detection
[NLP-15] 通过不流利检测增强自动语音识别模型

链接: https://arxiv.org/abs/2409.10177
作者: Robin Amann,Zhaolin Li,Barbara Bruno,Jan Niehues
关键词-EN: disfluency commonly occurs, Automatic Speech Recognition, Speech disfluency commonly, disfluency commonly, commonly occurs
关键词-ZH: 不流利通常发生,自动语音识别,言语不流利通常,不流利通常,通常发生
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by SLT2024

点击查看摘要

Abstract:Speech disfluency commonly occurs in conversational and spontaneous speech. However, standard Automatic Speech Recognition (ASR) models struggle to accurately recognize these disfluencies because they are typically trained on fluent transcripts. Current research mainly focuses on detecting disfluencies within transcripts, overlooking their exact location and duration in the speech. Additionally, previous work often requires model fine-tuning and addresses limited types of disfluencies. In this work, we present an inference-only approach to augment any ASR model with the ability to detect open-set disfluencies. We first demonstrate that ASR models have difficulty transcribing speech disfluencies. Next, this work proposes a modified Connectionist Temporal Classification(CTC)-based forced alignment algorithm from \citekurzinger2020ctc to predict word-level timestamps while effectively capturing disfluent speech. Additionally, we develop a model to classify alignment gaps between timestamps as either containing disfluent speech or silence. This model achieves an accuracy of 81.62% and an F1-score of 80.07%. We test the augmentation pipeline of alignment gap detection and classification on a disfluent dataset. Our results show that we captured 74.13% of the words that were initially missed by the transcription, demonstrating the potential of this pipeline for downstream tasks. Comments: Accepted by SLT2024 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.10177 [cs.CL] (or arXiv:2409.10177v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.10177 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:言语不流利主要发生在会话和自发性言语中。然而,标准的自动语音识别(ASR)模型很难准确地识别这些不流利的地方,因为它们通常是在流利的成绩单上训练的。目前的研究主要集中在检测文本中的不流利性,而忽略了它们在演讲中的确切位置和持续时间。此外,以前的工作通常需要模型微调,并解决有限类型的不流畅。在这项工作中,我们提出了一种仅限推理的方法来增强任何ASR模型的能力,以检测开集不流利性。我们首先证明了ASR模型在转录语音不流利方面存在困难。接下来,本文提出了一种基于改进的连接主义时间分类(CTC)的基于Citekurzinger2020ctc的强制对齐算法,在预测词级时间戳的同时有效地捕获不流利的语音。此外,我们开发了一个模型来将时间戳之间的对齐间隙分类为包含不流畅的语音或静默。该模型的准确率为81.62,F1得分为80.07。我们在一个不流畅的数据集上测试了比对差距检测和分类的增强流水线。我们的结果显示,我们捕获了74.13个最初被转录错过的单词,表明了该管道对下游任务的潜力。评论:被SLT2024科目接受:计算与语言(cs.CL);人工智能(cs.AI)引用为:arxiv:2409.10177cs.CLhttps://doi.org/10.48550/arXiv.2409.10177 Focus通过DataCite了解更多arxiv发布的DOI(等待注册)

[NLP-16] jina-embeddings-v3: Multilingual Embeddings With Task LoRA
[NLP-16] jina-embeddings-v3:多语言嵌入任务LoRA

链接: https://arxiv.org/abs/2409.10173
作者: Saba Sturua,Isabelle Mohr,Mohammad Kalim Akram,Michael Günther,Bo Wang,Markus Krimmel,Feng Wang,Georgios Mastrapas,Andreas Koukounas,Andreas Koukounas,Nan Wang,Han Xiao
关键词-EN: supporting context lengths, million parameters, supporting context, Matryoshka Representation Learning, long-context retrieval tasks
关键词-ZH: 支持上下文长度、百万个参数、支持上下文、Matryoshka表示学习、长上下文检索任务
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 20 pages, pp11-13 references, pp14-20 appendix and experiment tables

点击查看摘要

Abstract:We introduce jina-embeddings-v3, a novel text embedding model with 570 million parameters, achieves state-of-the-art performance on multilingual data and long-context retrieval tasks, supporting context lengths of up to 8192 tokens. The model includes a set of task-specific Low-Rank Adaptation (LoRA) adapters to generate high-quality embeddings for query-document retrieval, clustering, classification, and text matching. Additionally, Matryoshka Representation Learning is integrated into the training process, allowing flexible truncation of embedding dimensions without compromising performance. Evaluation on the MTEB benchmark shows that jina-embeddings-v3 outperforms the latest proprietary embeddings from OpenAI and Cohere on English tasks, while achieving superior performance compared to multilingual-e5-large-instruct across all multilingual tasks.
摘要:我们引入了jina-embeddings-v3,这是一种具有5.7亿个参数的新颖文本嵌入模型,在多语言数据和长上下文检索任务上实现了最先进的性能,支持高达8192个令牌的上下文长度。该模型包括一组特定于任务的低等级自适应(LoRA)适配器,用于生成用于查询文档检索、集群、分类和文本匹配的高质量嵌入。此外,Matryoshka Representative Learning已集成到训练过程中,允许灵活地截断嵌入维度,而不会影响性能。对MTEB基准的评估表明,jina-embeddings-v3在英语任务上的表现优于OpenAI和Kohere的最新专有嵌入,同时在所有多语言任务中实现了与多语言e5-大型指令相比更卓越的性能。

[NLP-17] Quantile Regression for Distributional Reward Models in RLHF
[NLP-17] WLHF中分配奖励模型的分位数回归

链接: https://arxiv.org/abs/2409.10164
作者: Nicolai Dorka
关键词-EN: aligning large language, large language models, RLHF, aligning large, large language
关键词-ZH: 对齐大型语言、大型语言模型、RL HF、对齐大型语言
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) has become a key method for aligning large language models (LLMs) with human preferences through the use of reward models. However, traditional reward models typically generate point estimates, which oversimplify the diversity and complexity of human values and preferences. In this paper, we introduce Quantile Reward Models (QRMs), a novel approach to reward modeling that learns a distribution over rewards instead of a single scalar value. Our method uses quantile regression to estimate a full, potentially multimodal distribution over preferences, providing a more powerful and nuanced representation of preferences. This distributional approach can better capture the diversity of human values, addresses label noise, and accommodates conflicting preferences by modeling them as distinct modes in the distribution. Our experimental results show that QRM outperforms comparable traditional point-estimate models on RewardBench. Furthermore, we demonstrate that the additional information provided by the distributional estimates can be utilized in downstream applications, such as risk-aware reinforcement learning, resulting in LLM policies that generate fewer extremely negative responses. Our code and model are released at this https URL.
摘要:从人类反馈中强化学习(RLHF)已成为通过使用奖励模型来使大语言模型(LLM)符合人类偏好的一种关键方法。然而,传统的奖励模型通常会生成点数估计,这过度简化了人类价值观和偏好的多样性和复杂性。在本文中,我们介绍了分位数报酬模型(QRM),这是一种新的报酬建模方法,它学习报酬的分布而不是单一的标量值。我们的方法使用分位数回归来估计偏好的完整的、潜在的多峰分布,提供了更强大和更细微的偏好表示。这种分布方法可以更好地捕捉人类价值观的多样性,解决标签噪声,并通过将它们建模为分布中的不同模式来适应相互冲突的偏好。我们的实验结果表明,QRM在RewardBitch上的性能优于传统的点估计模型。此外,我们证明了由分布估计提供的额外信息可以在下游应用中使用,例如风险感知强化学习,从而导致LLM策略产生更少的极端负面响应。我们的代码和模型在此HTTPS URL上发布。

[NLP-18] LLMs4OL 2024 Overview: The 1st Large Language Models for Ontology Learning Challenge ISWC2024
[NLP-18] LLMS 4OL 2024概览:第一届实体学习挑战赛大型语言模型

链接: https://arxiv.org/abs/2409.10146
作者: Hamed Babaei Giglou,Jennifer D’Souza,Sören Auer
关键词-EN: Large Language Models, Ontology Learning Challenge, Large Language, Language Models, Ontology Learning
关键词-ZH: 大型语言模型、实体学习挑战、大型语言、语言模型、实体学习
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 1 figure, Will appear in “The 1st LLMs4OL Challenge @ ISWC 2024” proceedings

点击查看摘要

Abstract:This paper outlines the LLMs4OL 2024, the first edition of the Large Language Models for Ontology Learning Challenge. LLMs4OL is a community development initiative collocated with the 23rd International Semantic Web Conference (ISWC) to explore the potential of Large Language Models (LLMs) in Ontology Learning (OL), a vital process for enhancing the web with structured knowledge to improve interoperability. By leveraging LLMs, the challenge aims to advance understanding and innovation in OL, aligning with the goals of the Semantic Web to create a more intelligent and user-friendly web. In this paper, we give an overview of the 2024 edition of the LLMs4OL challenge and summarize the contributions.
摘要:本文概述了LLMS 4OL 2024,这是Ontology Learning Challenge的第一版。LLMS 4OL是一项与第23届国际语义Web会议(ISWC)同时举办的社区开发计划,旨在探索大型语言模型(LLM)在实体学习(OL)中的潜力,这是利用结构化知识增强网络以提高互操作性的重要过程。通过利用LLM,该挑战旨在促进对OL的理解和创新,与语义网的目标保持一致,以创建一个更智能和用户友好的网络。在本文中,我们概述了2024年版的LLMS 4OL挑战赛并总结了其中的贡献。

[NLP-19] StruEdit: Structured Outputs Enable the Fast and Accurate Knowledge Editing for Large Language Models
[NLP-19] StruEdit:结构化输出支持大型语言模型快速准确的知识编辑

链接: https://arxiv.org/abs/2409.10132
作者: Baolong Bi,Shenghua Liu,Yiwei Wang,Lingrui Mei,Hongcheng Gao,Junfeng Fang,Xueqi Cheng
关键词-EN: large language models, question answering, modern tool, tool of choice, choice for question
关键词-ZH: 大型语言模型、问答、现代工具、选择工具、问题选择
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As the modern tool of choice for question answering, large language models (LLMs) are expected to deliver answers with up-to-date knowledge. To achieve such ideal question-answering systems, locating and then editing outdated knowledge in the natural language outputs is a general target of popular knowledge editing methods. However, this target is challenging, as both identifying which tokens to edit in the reasoning steps and ensuring the coherence of the revised reasoning chain are difficult tasks. We argue that these challenges stem from the unstructured nature of natural language outputs. To address the above challenges, we propose \textbfStru ctural \textbfEdit ing ( \textbfStruEdit ), an improved baseline for knowledge editing. We first prompt LLMs to produce structured outputs consisting of reasoning triplets. Then, StruEdit removes any potentially outdated knowledge and efficiently refills the structured outputs with up-to-date information in a single step. Experimental results show that StruEdit consistently delivers the highest accuracy with lowest latency compared with other knowledge editing methods.
摘要:作为现代问答的首选工具,大型语言模型(LLMS)被期望提供具有最新知识的答案。为了实现这种理想的问答系统,在自然语言输出中定位并编辑过时的知识是流行的知识编辑方法的一般目标。然而,这一目标具有挑战性,因为既要确定在推理步骤中要编辑哪些令牌,又要确保修订后的推理链的一致性都是困难的任务。我们认为,这些挑战源于自然语言输出的非结构化性质。为了应对上述挑战,我们提出了一种改进的知识编辑基线我们首先提示LLMS产生由推理三元组组成的结构化输出。然后,StruEdit删除任何可能过时的知识,并在单个步骤中高效地使用最新信息重新填充结构化输出。实验结果表明,与其他知识编辑方法相比,StruEDIT提供了最高的准确率和最低的延迟。

[NLP-20] Self-Supervised Syllable Discovery Based on Speaker-Disentangled HuBERT
[NLP-20] 基于扬声器分离HuBERT的自监督音节发现

链接: https://arxiv.org/abs/2409.10103
作者: Ryota Komatsu,Takahiro Shinozaki
关键词-EN: extracting meaningful features, untranscribed audio, essential for extracting, extracting meaningful, CLS token
关键词-ZH: 提取有意义的特征、未转录的音频、提取必不可少的、提取有意义的、LIS标记
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted by IEEE SLT 2024

点击查看摘要

Abstract:Self-supervised speech representation learning has become essential for extracting meaningful features from untranscribed audio. Recent advances highlight the potential of deriving discrete symbols from the features correlated with linguistic units, which enables text-less training across diverse tasks. In particular, sentence-level Self-Distillation of the pretrained HuBERT (SD-HuBERT) induces syllabic structures within latent speech frame representations extracted from an intermediate Transformer layer. In SD-HuBERT, sentence-level representation is accumulated from speech frame features through self-attention layers using a special CLS token. However, we observe that the information aggregated in the CLS token correlates more with speaker identity than with linguistic content. To address this, we propose a speech-only self-supervised fine-tuning approach that separates syllabic units from speaker information. Our method introduces speaker perturbation as data augmentation and adopts a frame-level training objective to prevent the CLS token from aggregating paralinguistic information. Experimental results show that our approach surpasses the current state-of-the-art method in most syllable segmentation and syllabic unit quality metrics on Librispeech, underscoring its effectiveness in promoting syllabic organization within speech-only models.
摘要:自监督语音表征学习已成为从未转录音频中提取有意义特征的关键。最近的进展突出了从与语言单位相关的特征中提取离散符号的潜力,这使得跨不同任务的无文本训练成为可能。具体地说,预先训练的休伯特(SD-Hubert)的句子级自蒸馏在从中间变换器层提取的潜在语音帧表示内诱导音节结构。在SD-Hubert中,句子级表征是通过使用特殊的CLS标记从语音框架特征通过自我注意层积累而来的。然而,我们观察到,CLS标记中聚集的信息更多地与说话人身份相关,而不是与语言内容相关。为了解决这一问题,我们提出了一种仅针对语音的自监督微调方法,将音节单元与说话人信息分开。我们的方法引入说话人扰动作为数据扩充,并采用帧级别的训练目标来防止CLS标记聚集副语言信息。实验结果表明,我们的方法在Librispeech上的大部分音节切分和音节单元质量度量上都超过了目前最先进的方法,从而强调了该方法在促进纯语音模型中音节组织方面的有效性。

[NLP-21] rustworthiness in Retrieval-Augmented Generation Systems: A Survey
[NLP-21] 检索增强发电系统的耐腐蚀性:调查

链接: https://arxiv.org/abs/2409.10102
作者: Yujia Zhou,Yan Liu,Xiaoxi Li,Jiajie Jin,Hongjin Qian,Zheng Liu,Chaozhuo Li,Zhicheng Dou,Tsung-Yi Ho,Philip S. Yu
关键词-EN: Large Language Models, Large Language, RAG systems, Retrieval-Augmented Generation, development of Large
关键词-ZH: 大型语言模型、大型语言、RAG系统、检索增强生成、大型开发
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has quickly grown into a pivotal paradigm in the development of Large Language Models (LLMs). While much of the current research in this field focuses on performance optimization, particularly in terms of accuracy and efficiency, the trustworthiness of RAG systems remains an area still under exploration. From a positive perspective, RAG systems are promising to enhance LLMs by providing them with useful and up-to-date knowledge from vast external databases, thereby mitigating the long-standing problem of hallucination. While from a negative perspective, RAG systems are at the risk of generating undesirable contents if the retrieved information is either inappropriate or poorly utilized. To address these concerns, we propose a unified framework that assesses the trustworthiness of RAG systems across six key dimensions: factuality, robustness, fairness, transparency, accountability, and privacy. Within this framework, we thoroughly review the existing literature on each dimension. Additionally, we create the evaluation benchmark regarding the six dimensions and conduct comprehensive evaluations for a variety of proprietary and open-source models. Finally, we identify the potential challenges for future research based on our investigation results. Through this work, we aim to lay a structured foundation for future investigations and provide practical insights for enhancing the trustworthiness of RAG systems in real-world applications.
摘要:检索-扩充生成(RAG)已经迅速发展成为大型语言模型(LLMS)开发中的一个关键范例。虽然目前该领域的大部分研究都集中在性能优化上,特别是在精度和效率方面,但RAG系统的可信性仍然是一个有待探索的领域。从积极的角度来看,RAG系统有望通过从庞大的外部数据库中向LLM提供有用和最新的知识来增强LLM,从而缓解长期存在的幻觉问题。然而,从负面的角度来看,如果检索到的信息不适当或利用不当,RAG系统就有产生不良内容的风险。为了解决这些问题,我们提出了一个统一的框架,从六个关键维度评估RAG系统的可信性:真实性、健壮性、公平性、透明度、问责制和隐私。在这个框架内,我们彻底审查了每个方面的现有文献。此外,我们创建了关于六个维度的评估基准,并对各种专有和开源模型进行了全面评估。最后,基于我们的调查结果,我们确定了未来研究的潜在挑战。通过这项工作,我们的目标是为未来的研究奠定一个结构化的基础,并为提高RAG系统在现实世界应用中的可信性提供实践见解。

[NLP-22] LLM-DER:A Named Entity Recognition Method Based on Large Language Models for Chinese Coal Chemical Domain
[NLP-22] LLM-BER:一种面向中国煤化工领域的基于大语言模型的命名实体识别方法

链接: https://arxiv.org/abs/2409.10077
作者: Le Xiao,Yunfei Xu,Jing Zhao
关键词-EN: Named Entity Recognition, Domain-specific Named Entity, domain-specific entity recognition, Entity Recognition, domain knowledge graphs
关键词-ZH: 命名实体识别、特定领域命名实体、特定领域实体识别、实体识别、领域知识图
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Domain-specific Named Entity Recognition (NER), whose goal is to recognize domain-specific entities and their categories, provides an important support for constructing domain knowledge graphs. Currently, deep learning-based methods are widely used and effective in NER tasks, but due to the reliance on large-scale labeled data. As a result, the scarcity of labeled data in a specific domain will limit its application.Therefore, many researches started to introduce few-shot methods and achieved some results. However, the entity structures in specific domains are often complex, and the current few-shot methods are difficult to adapt to NER tasks with complex features.Taking the Chinese coal chemical industry domain as an example,there exists a complex structure of multiple entities sharing a single entity, as well as multiple relationships for the same pair of entities, which affects the NER task under the sample less this http URL this paper, we propose a Large Language Models (LLMs)-based entity recognition framework LLM-DER for the domain-specific entity recognition problem in Chinese, which enriches the entity information by generating a list of relationships containing entity types through LLMs, and designing a plausibility and consistency evaluation method to remove misrecognized entities, which can effectively solve the complex structural entity recognition problem in a specific domain.The experimental results of this paper on the Resume dataset and the self-constructed coal chemical dataset Coal show that LLM-DER performs outstandingly in domain-specific entity recognition, not only outperforming the existing GPT-3.5-turbo baseline, but also exceeding the fully-supervised baseline, verifying its effectiveness in entity recognition.
摘要:领域特定命名实体识别(NER)旨在识别领域特定实体及其类别,为构建领域知识图提供了重要支持。目前,基于深度学习的方法在神经网络任务中得到了广泛的应用并取得了很好的效果,但由于依赖于大规模的标记数据。因此,在特定领域中标注数据的稀缺性将限制其应用,因此,许多研究开始引入少镜头方法,并取得了一些成果。以中国煤化工领域为例,存在多个实体共享一个实体的复杂结构,以及同一对实体之间存在多个关系的复杂结构,这影响了样本下的实体识别任务。针对特定领域的中文实体识别问题,提出了一种基于大语言模型的实体识别框架LLm-der,该框架通过LLMS生成包含实体类型的关系列表,丰富了实体信息。本文在简历数据集和自建煤化工数据集上的实验结果表明,LLM-DER在特定领域的实体识别中表现突出,不仅优于现有的GPT-3.5-Turbo基线,而且超过了全监督基线,验证了其在实体识别中的有效性。

[NLP-23] Increasing faithfulness in human-human dialog summarization with Spoken Language Understanding tasks
[NLP-23] 通过口语理解任务提高人与人对话总结的忠实性

链接: https://arxiv.org/abs/2409.10070
作者: Eunice Akani,Benoit Favre,Frederic Bechet,Romain Gemignani
关键词-EN: aims to provide, provide a concise, concise and coherent, conversations between multiple, multiple speakers
关键词-ZH: 旨在提供多个说话者之间的简洁、简洁和连贯的对话
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dialogue summarization aims to provide a concise and coherent summary of conversations between multiple speakers. While recent advancements in language models have enhanced this process, summarizing dialogues accurately and faithfully remains challenging due to the need to understand speaker interactions and capture relevant information. Indeed, abstractive models used for dialog summarization may generate summaries that contain inconsistencies. We suggest using the semantic information proposed for performing Spoken Language Understanding (SLU) in human-machine dialogue systems for goal-oriented human-human dialogues to obtain a more semantically faithful summary regarding the task. This study introduces three key contributions: First, we propose an exploration of how incorporating task-related information can enhance the summarization process, leading to more semantically accurate summaries. Then, we introduce a new evaluation criterion based on task semantics. Finally, we propose a new dataset version with increased annotated data standardized for research on task-oriented dialogue summarization. The study evaluates these methods using the DECODA corpus, a collection of French spoken dialogues from a call center. Results show that integrating models with task-related information improves summary accuracy, even with varying word error rates.
摘要:对话摘要旨在对多个说话人之间的对话进行简明、连贯的总结。虽然语言模型的最新进展加强了这一过程,但由于需要理解说话人的互动并获取相关信息,准确和忠实地总结对话仍然具有挑战性。事实上,用于对话摘要的抽象模型可能会生成包含不一致的摘要。我们建议在人机对话系统中使用所提出的用于执行口语理解(SLU)的语义信息来进行面向目标的人-人对话,以获得关于任务的更真实的语义摘要。这项研究介绍了三个主要贡献:第一,我们提出了一项探索,即结合与任务相关的信息如何增强摘要过程,从而导致更准确的语义摘要。然后,我们引入了一种新的基于任务语义的评价标准。最后,我们提出了一个新的数据集版本,增加了标准化的标注数据,用于面向任务的对话摘要的研究。这项研究使用德科达语料库对这些方法进行了评估,德科达语料库收集了呼叫中心的法语口语对话。结果表明,将模型与任务相关的信息相结合,即使在不同的词错误率的情况下,也可以提高摘要的准确性。

[NLP-24] MindGuard: Towards Accessible and Sitgma-free Mental Health First Aid via Edge LLM
[NLP-24] MindGuard:通过Edge LLM实现无障碍且无Sitgma的心理健康急救

链接: https://arxiv.org/abs/2409.10064
作者: Sijie Ji,Xinzhe Zheng,Jiawei Sun,Renqi Chen,Wei Gao,Mani Srivastava
关键词-EN: prevalent diseases worldwide, diseases worldwide, prevalent diseases, Mental health disorders, Mental health
关键词-ZH: 全球流行疾病、全球疾病、流行疾病、精神健康障碍、精神健康
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Mental health disorders are among the most prevalent diseases worldwide, affecting nearly one in four people. Despite their widespread impact, the intervention rate remains below 25%, largely due to the significant cooperation required from patients for both diagnosis and intervention. The core issue behind this low treatment rate is stigma, which discourages over half of those affected from seeking help. This paper presents MindGuard, an accessible, stigma-free, and professional mobile mental healthcare system designed to provide mental health first aid. The heart of MindGuard is an innovative edge LLM, equipped with professional mental health knowledge, that seamlessly integrates objective mobile sensor data with subjective Ecological Momentary Assessment records to deliver personalized screening and intervention conversations. We conduct a broad evaluation of MindGuard using open datasets spanning four years and real-world deployment across various mobile devices involving 20 subjects for two weeks. Remarkably, MindGuard achieves results comparable to GPT-4 and outperforms its counterpart with more than 10 times the model size. We believe that MindGuard paves the way for mobile LLM applications, potentially revolutionizing mental healthcare practices by substituting self-reporting and intervention conversations with passive, integrated monitoring within daily life, thus ensuring accessible and stigma-free mental health support.
摘要:精神健康障碍是世界上最常见的疾病之一,近四分之一的人受到影响。尽管影响广泛,但干预率仍低于25%,这在很大程度上是因为诊断和干预都需要患者的大力合作。这种低治疗率背后的核心问题是耻辱,这阻碍了超过一半的受影响的人寻求帮助。本文介绍了MindGuard,这是一个可访问的、无污名的、专业的移动心理保健系统,旨在提供心理健康急救。MindGuard的核心是创新的EDGE LLM,配备了专业的心理健康知识,可将客观移动传感器数据与主观生态瞬时评估记录无缝集成,以提供个性化的筛查和干预对话。我们使用跨越四年的开放数据集和涉及20名受试者的各种移动设备的真实世界部署,在两周内对MindGuard进行了广泛的评估。值得注意的是,MindGuard取得了与GPT-4相当的结果,并且比同行的模型大小高出10倍以上。我们相信,MindGuard为移动LLM应用程序铺平了道路,通过将自我报告和干预对话替换为日常生活中被动的集成监控,潜在地为精神健康实践带来革命性的变化,从而确保可获得和无污名的精神健康支持。

[NLP-25] Householder Pseudo-Rotation: A Novel Approach to Activation Editing in LLMs with Direction-Magnitude Perspective
[NLP-25] 户主伪旋转:具有方向-幅度视角的LLM激活编辑的一种新方法

链接: https://arxiv.org/abs/2409.10053
作者: Van-Cuong Pham,Thien Huu Nguyen
关键词-EN: large language models, involves directly editting, achieve desired properties, language models, desired properties
关键词-ZH: 大型语言模型,涉及直接编辑,实现所需的属性、语言模型、所需的属性
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Activation Editing, which involves directly editting the internal representations of large language models (LLMs) to alter their behaviors and achieve desired properties, has emerged as a promising area of research. Existing works primarily treat LLMs’ activations as points in space and modify them by adding steering vectors. However, this approach is limited in its ability to achieve greater performance improvement while maintaining the necessary consistency of activation magnitudes. To overcome these issues, we propose a novel editing method that views activations in terms of their directions and magnitudes. Our method, named Householder Pseudo-Rotation (HPR), mimics the rotation transformation, thus preserving activation norms and resulting in an improved performance on various safety benchmarks.
摘要:激活编辑涉及直接编辑大型语言模型(LLM)的内部表示以改变其行为并实现所需的属性,已成为一个有前途的研究领域。现有的作品主要将LLM的激活视为空间中的点,并通过添加转向载体来修改它们。然而,这种方法在实现更大性能改进同时保持激活幅度必要一致性的能力有限。为了克服这些问题,我们提出了一种新颖的编辑方法,该方法根据其方向和幅度来查看激活。我们的方法名为Householder伪旋转(HPR),模仿旋转变换,从而保留激活规范并提高各种安全基准的性能。

[NLP-26] Benchmarking Large Language Model Uncertainty for Prompt Optimization
[NLP-26] 对大型语言模型不确定性进行基准测试以快速优化

链接: https://arxiv.org/abs/2409.10044
作者: Pei-Fu Guo,Yun-Da Tsai,Shou-De Lin
关键词-EN: Large Language Models, Large Language, effective uncertainty estimation, lack effective uncertainty, algorithms for Large
关键词-ZH: 大型语言模型,大型语言,有效的不确定性估计,缺乏有效的不确定性,大型算法
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prompt optimization algorithms for Large Language Models (LLMs) excel in multi-step reasoning but still lack effective uncertainty estimation. This paper introduces a benchmark dataset to evaluate uncertainty metrics, focusing on Answer, Correctness, Aleatoric, and Epistemic Uncertainty. Through analysis of models like GPT-3.5-Turbo and Meta-Llama-3.1-8B-Instruct, we show that current metrics align more with Answer Uncertainty, which reflects output confidence and diversity, rather than Correctness Uncertainty, highlighting the need for improved metrics that are optimization-objective-aware to better guide prompt optimization. Our code and dataset are available at this https URL.
摘要:大型语言模型(LLM)的即时优化算法在多步推理方面表现出色,但仍然缺乏有效的不确定性估计。本文引入了一个基准数据集来评估不确定性指标,重点关注答案、正确性、不确定性和认识不确定性。通过对GPT-3.5-Turbo和Meta-Llama-3.1- 8B-Direct等模型的分析,我们表明当前的指标更符合反映输出信心和多样性的回答不确定性,而不是正确性不确定性,这凸显了对改进指标的必要性,这些指标具有优化目标意识,以更好地指导及时优化。我们的代码和数据集可在此httpsURL中获取。

[NLP-27] On the Diagram of Thought
[NLP-27] 关于思维图表

链接: https://arxiv.org/abs/2409.10038
作者: Yifan Zhang,Yang Yuan,Andrew Chi-Chih Yao
关键词-EN: directed acyclic graph, acyclic graph, cohesive DAG structure, Diagram of Thought, directed acyclic
关键词-ZH: 有向无环图、无环图、内聚DAB结构、思维图、有向无环
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce Diagram of Thought (DoT), a framework that models iterative reasoning in large language models (LLMs) as the construction of a directed acyclic graph (DAG) within a single model. Unlike traditional approaches that represent reasoning as linear chains or trees, DoT organizes propositions, critiques, refinements, and verifications into a cohesive DAG structure, allowing the model to explore complex reasoning pathways while maintaining logical consistency. Each node in the diagram corresponds to a proposition that has been proposed, critiqued, refined, or verified, enabling the LLM to iteratively improve its reasoning through natural language feedback. By leveraging auto-regressive next-token prediction with role-specific tokens, DoT facilitates seamless transitions between proposing ideas and critically evaluating them, providing richer feedback than binary signals. Furthermore, we formalize the DoT framework using Topos Theory, providing a mathematical foundation that ensures logical consistency and soundness in the reasoning process. This approach enhances both the training and inference processes within a single LLM, eliminating the need for multiple models or external control mechanisms. DoT offers a conceptual framework for designing next-generation reasoning-specialized models, emphasizing training efficiency, robust reasoning capabilities, and theoretical grounding. The code is available at this https URL.
摘要:我们介绍了思想图(DOT),这是一个在大型语言模型(LLMS)中将迭代推理建模为在单个模型中构建有向无环图(DAG)的框架。与将推理表示为线性链或树的传统方法不同,DOT将命题、批评、细化和验证组织到一个连贯的DAG结构中,允许模型在保持逻辑一致性的同时探索复杂的推理路径。图中的每一个节点都对应于一个已被提出、批评、改进或验证的命题,从而使LLM能够通过自然语言反馈迭代地改进其推理。通过利用具有角色特定令牌的自动回归下一令牌预测,DOT促进了提出想法和批判性评估之间的无缝转换,提供了比二进制信号更丰富的反馈。此外,我们使用Topos理论形式化了DOT框架,为确保推理过程中的逻辑一致性和可靠性提供了数学基础。这种方法增强了单个LLM中的训练和推理过程,消除了对多个模型或外部控制机制的需要。DOT为设计下一代推理专用模型提供了一个概念性框架,强调训练效率、稳健的推理能力和理论基础。代码可在此HTTPS URL上找到。

[NLP-28] AceParse: A Comprehensive Dataset with Diverse Structured Texts for Academic Literature Parsing
[NLP-28] AceParse:具有多样化结构文本的综合数据集,用于学术文献解析

链接: https://arxiv.org/abs/2409.10016
作者: Huawei Ji,Cheng Deng,Bo Xue,Zhouyang Jin,Jiaxin Ding,Xiaoying Gan,Luoyi Fu,Xinbing Wang,Chenghu Zhou
关键词-EN: improving data quality, data quality, Academic literature, development of data-centric, focus has shifted
关键词-ZH: 提高数据质量,数据质量,学术文献,以数据为中心的发展,重点已经转移
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures, 3 tables

点击查看摘要

Abstract:With the development of data-centric AI, the focus has shifted from model-driven approaches to improving data quality. Academic literature, as one of the crucial types, is predominantly stored in PDF formats and needs to be parsed into texts before further processing. However, parsing diverse structured texts in academic literature remains challenging due to the lack of datasets that cover various text structures. In this paper, we introduce AceParse, the first comprehensive dataset designed to support the parsing of a wide range of structured texts, including formulas, tables, lists, algorithms, and sentences with embedded mathematical expressions. Based on AceParse, we fine-tuned a multimodal model, named AceParser, which accurately parses various structured texts within academic literature. This model outperforms the previous state-of-the-art by 4.1% in terms of F1 score and by 5% in Jaccard Similarity, demonstrating the potential of multimodal models in academic literature parsing. Our dataset is available at this https URL.
摘要:随着以数据为中心的人工智能的发展,人们的注意力已经从模型驱动的方法转移到提高数据质量上来。学术文献作为其中的一种重要类型,主要以PDF格式存储,需要在进一步处理之前解析成文本。然而,由于缺乏涵盖各种文本结构的数据集,对学术文献中不同结构的文本进行句法分析仍然具有挑战性。在本文中,我们介绍了AceParse,这是第一个全面的数据集,旨在支持对各种结构化文本的解析,包括公式、表格、列表、算法和嵌入数学表达式的句子。在AceParse的基础上,我们对一个称为AceParser的多通道模型进行了微调,该模型能够准确地解析学术文献中的各种结构化文本。该模型在F1评分和Jaccard相似度方面分别比现有的同类模型高4.1%和5%,显示了多通道模型在学术文献句法分析中的应用潜力。我们的数据集可通过此HTTPS URL获得。

[NLP-29] HALO: Hallucination Analysis and Learning Optimization to Empower LLMs with Retrieval-Augmented Context for Guided Clinical Decision Making
[NLP-29] HALO:幻觉分析和学习优化,通过检索增强上下文来增强LLM的指导性临床决策

链接: https://arxiv.org/abs/2409.10011
作者: Sumera Anjum,Hanzhi Zhang,Wenjun Zhou,Eun Jin Paek,Xiaopeng Zhao,Yunhe Feng
关键词-EN: Large language models, language processing tasks, advanced natural language, natural language processing, significantly advanced natural
关键词-ZH: 大型语言模型、语言处理任务、高级自然语言、自然语言处理、显着高级自然
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Large language models (LLMs) have significantly advanced natural language processing tasks, yet they are susceptible to generating inaccurate or unreliable responses, a phenomenon known as hallucination. In critical domains such as health and medicine, these hallucinations can pose serious risks. This paper introduces HALO, a novel framework designed to enhance the accuracy and reliability of medical question-answering (QA) systems by focusing on the detection and mitigation of hallucinations. Our approach generates multiple variations of a given query using LLMs and retrieves relevant information from external open knowledge bases to enrich the context. We utilize maximum marginal relevance scoring to prioritize the retrieved context, which is then provided to LLMs for answer generation, thereby reducing the risk of hallucinations. The integration of LangChain further streamlines this process, resulting in a notable and robust increase in the accuracy of both open-source and commercial LLMs, such as Llama-3.1 (from 44% to 65%) and ChatGPT (from 56% to 70%). This framework underscores the critical importance of addressing hallucinations in medical QA systems, ultimately improving clinical decision-making and patient care. The open-source HALO is available at: this https URL.
摘要:大型语言模型极大地提高了自然语言处理任务,但它们容易产生不准确或不可靠的反应,这种现象被称为幻觉。在健康和医学等关键领域,这些幻觉可能会带来严重的风险。本文介绍了Halo,这是一个新的框架,旨在通过专注于幻觉的检测和缓解来提高医疗问答系统的准确性和可靠性。我们的方法使用LLMS生成给定查询的多个变体,并从外部开放知识库中检索相关信息以丰富上下文。我们使用最大边际相关性评分来确定检索到的上下文的优先级,然后将其提供给LLMS用于生成答案,从而降低幻觉的风险。LangChain的整合进一步简化了这一过程,导致开源和商业LLMS的准确率显著而强劲地提高,如Llama-3.1(从44%到65%)和ChatGPT(从56%到70%)。这一框架强调了解决医疗质量保证系统中的幻觉,最终改善临床决策和患者护理的关键重要性。开源的Halo可在以下网址获得:This HTTPS URL。

[NLP-30] SelECT-SQL: Self-correcting ensemble Chain-of-Thought for Text-to-SQL
[NLP-30] SelECT-SQL:文本到SQL的自我纠正集成思想链

链接: https://arxiv.org/abs/2409.10007
作者: Ke Shen,Mayank Kejriwal
关键词-EN: data management research, formal SQL queries, natural language processing, converting questions posed, automatically converting questions
关键词-ZH: 数据管理研究、正式SQL查询、自然语言处理、转换提出的问题、自动转换问题
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years,Text-to-SQL, the problem of automatically converting questions posed in natural language to formal SQL queries, has emerged as an important problem at the intersection of natural language processing and data management research. Large language models (LLMs) have delivered impressive performance when used in an off-the-shelf performance, but still fall significantly short of expected expert-level performance. Errors are especially probable when a nuanced understanding is needed of database schemas, questions, and SQL clauses to do proper Text-to-SQL conversion. We introduce SelECT-SQL, a novel in-context learning solution that uses an algorithmic combination of chain-of-thought (CoT) prompting, self-correction, and ensemble methods to yield a new state-of-the-art result on challenging Text-to-SQL benchmarks. Specifically, when configured using GPT-3.5-Turbo as the base LLM, SelECT-SQL achieves 84.2% execution accuracy on the Spider leaderboard’s development set, exceeding both the best results of other baseline GPT-3.5-Turbo-based solutions (81.1%), and the peak performance (83.5%) of the GPT-4 result reported on the leaderboard.
摘要:近年来,将自然语言提出的问题自动转换为正式的SQL查询的Text-to-SQL问题已经成为自然语言处理和数据管理研究的一个重要课题。大型语言模型(LLM)在用于现成性能时提供了令人印象深刻的性能,但仍远远低于预期的专家级性能。当需要细致入微地理解数据库模式、问题和SQL子句以执行正确的文本到SQL转换时,出现错误的可能性尤其大。我们介绍了SELECT-SQL,这是一种新颖的情景学习解决方案,它使用思想链(CoT)提示、自我纠正和集成方法的算法组合,在挑战Text-to-SQL基准测试方面产生新的最先进的结果。具体地说,当使用GPT-3.5-Turbo作为基本LLM进行配置时,SELECT-SQL在Spider排行榜的开发集上实现了84.2%的执行准确率,超过了其他基于GPT-3.5-Turbo的解决方案的最佳结果(81.1%)和排行榜上报告的GPT-4结果的峰值性能(83.5%)。

[NLP-31] Comprehensive Study on Sentiment Analysis: From Rule-based to modern LLM based system
[NLP-31] 情感分析综合研究:从基于规则的现代LLM系统

链接: https://arxiv.org/abs/2409.09989
作者: Shailja Gupta,Rajesh Ranjan,Surya Narayan Singh
关键词-EN: large language models, sentiment analysis, artificial intelligence, large language, deep learning models
关键词-ZH: 大型语言模型、情感分析、人工智能、大型语言、深度学习模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 2 Images

点击查看摘要

Abstract:This paper provides a comprehensive survey of sentiment analysis within the context of artificial intelligence (AI) and large language models (LLMs). Sentiment analysis, a critical aspect of natural language processing (NLP), has evolved significantly from traditional rule-based methods to advanced deep learning techniques. This study examines the historical development of sentiment analysis, highlighting the transition from lexicon-based and pattern-based approaches to more sophisticated machine learning and deep learning models. Key challenges are discussed, including handling bilingual texts, detecting sarcasm, and addressing biases. The paper reviews state-of-the-art approaches, identifies emerging trends, and outlines future research directions to advance the field. By synthesizing current methodologies and exploring future opportunities, this survey aims to understand sentiment analysis in the AI and LLM context thoroughly.
摘要:本文对人工智能(AI)和大型语言模型(LLM)背景下的情感分析进行了全面调查。情感分析是自然语言处理(NLP)的一个关键方面,已经从传统的基于规则的方法显着发展到先进的深度学习技术。本研究探讨了情感分析的历史发展,强调了从基于词典和基于模式的方法向更复杂的机器学习和深度学习模型的转变。讨论了关键挑战,包括处理双语文本、检测讽刺和解决偏见。该论文回顾了最先进的方法,确定了新兴趋势,并概述了未来推进该领域的研究方向。通过综合当前的方法论并探索未来的机会,这项调查旨在彻底了解人工智能和LLM背景下的情绪分析。

[NLP-32] Gaps or Hallucinations? Gazing into Machine-Generated Legal Analysis for Fine-grained Text Evaluations
[NLP-32] 差距还是幻觉?关注机器生成的法律分析进行细粒度文本评估

链接: https://arxiv.org/abs/2409.09947
作者: Abe Bohan Hou,William Jurayj,Nils Holzenberger,Andrew Blair-Stanek,Benjamin Van Durme
关键词-EN: Large Language Models, Large Language, Language Models, performing legal analyses, professionals performing legal
关键词-ZH: 大型语言模型、大型语言、语言模型、执行法律分析、执行法律的专业人员
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) show promise as a writing aid for professionals performing legal analyses. However, LLMs can often hallucinate in this setting, in ways difficult to recognize by non-professionals and existing text evaluation metrics. In this work, we pose the question: when can machine-generated legal analysis be evaluated as acceptable? We introduce the neutral notion of gaps, as opposed to hallucinations in a strict erroneous sense, to refer to the difference between human-written and machine-generated legal analysis. Gaps do not always equate to invalid generation. Working with legal experts, we consider the CLERC generation task proposed in Hou et al. (2024b), leading to a taxonomy, a fine-grained detector for predicting gap categories, and an annotated dataset for automatic evaluation. Our best detector achieves 67% F1 score and 80% precision on the test set. Employing this detector as an automated metric on legal analysis generated by SOTA LLMs, we find around 80% contain hallucinations of different kinds.
摘要:大型语言模型(LLM)有望成为专业人士进行法律分析的写作辅助工具。然而,在这种情况下,LLMS通常会产生幻觉,其方式很难被非专业人员和现有的文本评估指标识别。在这项工作中,我们提出了一个问题:机器生成的法律分析何时才能被评估为可接受的?我们引入了中立的差距概念,而不是严格错误意义上的幻觉,以指人类编写的法律分析和机器生成的法律分析之间的差异。差距并不总是等同于无效的一代。与法律专家合作,我们考虑了侯等人提出的Clerc生成任务。(2024b),产生了分类、用于预测GAP类别的细粒度检测器和用于自动评估的注释数据集。我们最好的检测器在测试集上达到了67%的F1分数和80%的精度。使用这个检测器作为Sota LLMS生成的法律分析的自动度量,我们发现大约80%包含不同类型的幻觉。

[NLP-33] owards Data Contamination Detection for Modern Large Language Models : Limitations Inconsistencies and Oracle Challenges
[NLP-33] 现代大型语言模型的owards数据污染检测:局限性和Oracle挑战

链接: https://arxiv.org/abs/2409.09927
作者: Vinay Samuel,Yue Zhou,Henry Peng Zou
关键词-EN: increasingly impressive results, large language models, language models achieve, models achieve increasingly, achieve increasingly impressive
关键词-ZH: 越来越令人印象深刻的结果,大型语言模型,语言模型实现,模型实现越来越,实现越来越令人印象深刻
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 1 figure

点击查看摘要

Abstract:As large language models achieve increasingly impressive results, questions arise about whether such performance is from generalizability or mere data memorization. Thus, numerous data contamination detection methods have been proposed. However, these approaches are often validated with traditional benchmarks and early-stage LLMs, leaving uncertainty about their effectiveness when evaluating state-of-the-art LLMs on the contamination of more challenging benchmarks. To address this gap and provide a dual investigation of SOTA LLM contamination status and detection method robustness, we evaluate five contamination detection approaches with four state-of-the-art LLMs across eight challenging datasets often used in modern LLM evaluation. Our analysis reveals that (1) Current methods have non-trivial limitations in their assumptions and practical applications; (2) Notable difficulties exist in detecting contamination introduced during instruction fine-tuning with answer augmentation; and (3) Limited consistencies between SOTA contamination detection techniques. These findings highlight the complexity of contamination detection in advanced LLMs and the urgent need for further research on robust and generalizable contamination evaluation. Our code is available at this https URL.
摘要:随着大型语言模型取得越来越令人印象深刻的结果,出现了这样的表现是来自泛化还是仅仅来自数据记忆的问题。因此,已经提出了许多数据污染检测方法。然而,这些方法通常是通过传统基准和早期低成本管理进行验证的,在评估最先进的低成本管理对更具挑战性的基准的污染时,其有效性存在不确定性。为了弥补这一差距,并提供对SOTA LLM污染状态和检测方法稳健性的双重调查,我们使用四个最先进的LLM在八个具有挑战性的数据集上评估了五种污染检测方法,这些数据集经常用于现代LLM评估。我们的分析表明:(1)目前的方法在假设和实际应用方面有很大的局限性;(2)在使用答案增强进行教学微调时,在检测污染方面存在明显的困难;(3)SOTA污染检测技术之间的一致性有限。这些发现突显了先进的低成本管理系统中污染检测的复杂性,迫切需要进一步研究稳健和可推广的污染评估。我们的代码可以在这个HTTPS URL上找到。

[NLP-34] SFR-RAG: Towards Contextually Faithful LLMs
[NLP-34] SFR-RAG:迈向上下文忠实的LLM

链接: https://arxiv.org/abs/2409.09916
作者: Xuan-Phi Nguyen,Shrey Pandit,Senthil Purushwalkam,Austin Xu,Hailin Chen,Yifei Ming,Zixuan Ke,Silvio Savarese,Caiming Xong,Shafiq Joty
关键词-EN: Retrieval Augmented Generation, Retrieval Augmented, enhance factual accuracy, integrates external contextual, large language models
关键词-ZH: 检索增强生成、检索增强、增强事实准确性、集成外部上下文、大型语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Technical report

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG), a paradigm that integrates external contextual information with large language models (LLMs) to enhance factual accuracy and relevance, has emerged as a pivotal area in generative AI. The LLMs used in RAG applications are required to faithfully and completely comprehend the provided context and users’ questions, avoid hallucination, handle unanswerable, counterfactual or otherwise low-quality and irrelevant contexts, perform complex multi-hop reasoning and produce reliable citations. In this paper, we introduce SFR-RAG, a small LLM that is instruction-tuned with an emphasis on context-grounded generation and hallucination minimization. We also present ContextualBench, a new evaluation framework compiling multiple popular and diverse RAG benchmarks, such as HotpotQA and TriviaQA, with consistent RAG settings to ensure reproducibility and consistency in model assessments. Experimental results demonstrate that our SFR-RAG-9B model outperforms leading baselines such as Command-R+ (104B) and GPT-4o, achieving state-of-the-art results in 3 out of 7 benchmarks in ContextualBench with significantly fewer parameters. The model is also shown to be resilient to alteration in the contextual information and behave appropriately when relevant context is removed. Additionally, the SFR-RAG model maintains competitive performance in general instruction-following tasks and function-calling capabilities.
摘要:检索增强生成(RAG)是一种将外部上下文信息与大型语言模型(LLMS)相结合以提高事实准确性和相关性的范例,已成为生成式人工智能的一个关键领域。RAG应用中使用的LLM需要忠实和完整地理解所提供的上下文和用户的问题,避免产生幻觉,处理不可回答的、反事实的或其他低质量和无关的上下文,执行复杂的多跳推理并产生可靠的引用。在本文中,我们介绍了SFR-RAG,一个小的LLM,它是指令调优的,重点是基于上下文的生成和幻觉最小化。我们还提出了一个新的评估框架ContextualBch,它汇编了多个流行的和多样化的RAG基准,如HotpotQA和TriviaQA,具有一致的RAG设置,以确保模型评估的重复性和一致性。实验结果表明,我们的SFR-RAG-9B模型的性能优于Command-R+(104B)和GPT-4o等领先基准,在7个基准测试中有3个达到了最先进的结果,并且参数显著减少。该模型还被证明对上下文信息的改变具有弹性,并在相关上下文被移除时适当地表现。此外,SFR-RAG模型在一般指令遵循任务和函数调用能力方面保持了有竞争力的性能。

[NLP-35] Rediscovering the Latent Dimensions of Personality with Large Language Models as Trait Descriptors
[NLP-35] 以大型语言模型作为特质描述符重新发现人格的潜在维度

链接: https://arxiv.org/abs/2409.09905
作者: Joseph Suh,Suhong Moon,Minwoo Kang,David M. Chan
关键词-EN: Assessing personality traits, large language models, Assessing personality, area of research, large language
关键词-ZH: 评估性格特征,大型语言模型,评估性格,研究领域,大型语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Assessing personality traits using large language models (LLMs) has emerged as an interesting and challenging area of research. While previous methods employ explicit questionnaires, often derived from the Big Five model of personality, we hypothesize that LLMs implicitly encode notions of personality when modeling next-token responses. To demonstrate this, we introduce a novel approach that uncovers latent personality dimensions in LLMs by applying singular value de-composition (SVD) to the log-probabilities of trait-descriptive adjectives. Our experiments show that LLMs “rediscover” core personality traits such as extraversion, agreeableness, conscientiousness, neuroticism, and openness without relying on direct questionnaire inputs, with the top-5 factors corresponding to Big Five traits explaining 74.3% of the variance in the latent space. Moreover, we can use the derived principal components to assess personality along the Big Five dimensions, and achieve improvements in average personality prediction accuracy of up to 5% over fine-tuned models, and up to 21% over direct LLM-based scoring techniques.
摘要:使用大型语言模型评估人格特征已成为一个有趣而富有挑战性的研究领域。虽然以前的方法使用显性问卷,通常源自五大人格模型,但我们假设,在对下一代币反应进行建模时,LLM隐含地编码了人格概念。为了证明这一点,我们引入了一种新的方法,通过将奇异值分解(SVD)应用于特征描述形容词的对数概率来揭示LLMS中的潜在人格维度。我们的实验表明,LLMS在不依赖直接问卷输入的情况下,重新发现了外向、随和、尽责、神经质和开放性等核心人格特质,前五个因素对应于大五特质,解释了潜在空间中74.3%的变异。此外,我们可以使用派生的主成分来评估五大维度的个性,并实现了比微调模型的平均个性预测精度高达5%,比基于LLM的直接评分技术高达21%的平均个性预测准确率。

[NLP-36] Acquiring Pronunciation Knowledge from Transcribed Speech Audio via Multi-task Learning
[NLP-36] 通过多任务学习从转录的语音音频中获取发音知识

链接: https://arxiv.org/abs/2409.09891
作者: Siqi Sun,Korin Richmond
关键词-EN: transcribed speech audio, traditional pipeline-based frontend, Recent work, speech audio, additional training source
关键词-ZH: 转录的语音音频、传统的基于管道的前端、最近的工作、语音音频、额外的训练源
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 5 pages

点击查看摘要

Abstract:Recent work has shown the feasibility and benefit of bootstrapping an integrated sequence-to-sequence (Seq2Seq) linguistic frontend from a traditional pipeline-based frontend for text-to-speech (TTS). To overcome the fixed lexical coverage of bootstrapping training data, previous work has proposed to leverage easily accessible transcribed speech audio as an additional training source for acquiring novel pronunciation knowledge for uncovered words, which relies on an auxiliary ASR model as part of a cumbersome implementation flow. In this work, we propose an alternative method to leverage transcribed speech audio as an additional training source, based on multi-task learning (MTL). Experiments show that, compared to a baseline Seq2Seq frontend, the proposed MTL-based method reduces PER from 2.5% to 1.6% for those word types covered exclusively in transcribed speech audio, achieving a similar performance to the previous method but with a much simpler implementation flow.
摘要:最近的工作表明,从传统的基于流水线的文本到语音(TTC)前端引导集成的序列到序列(Seq 2Seq)语言前端的可行性和好处。为了克服自举训练数据的固定词汇覆盖,之前的工作提出利用易于访问的转录语音音频作为额外的训练源,用于获取未覆盖单词的新颖发音知识,这依赖于辅助ASB模型作为繁琐的实施流程的一部分。在这项工作中,我们提出了一种基于多任务学习(MTL)的替代方法,利用转录的语音音频作为额外的训练源。实验表明,与基线Seq 2Seq前端相比,对于那些仅在转录语音音频中覆盖的字词类型,提出的基于MTL的方法将PER从2.5%降低到1.6%,实现了与之前的方法类似的性能,但实现流程要简单得多。

[NLP-37] Constructing a Singing Style Caption Dataset
[NLP-37] 构建歌唱风格字幕数据集

链接: https://arxiv.org/abs/2409.09866
作者: Hyunjong Ok,Jaeho Lee
关键词-EN: Singing voice synthesis, voice generation, synthesis and conversion, conversion have emerged, emerged as significant
关键词-ZH: 歌唱声音合成、声音生成、合成与转换、转换都出现了,出现了意义重大
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Preprint

点击查看摘要

Abstract:Singing voice synthesis and conversion have emerged as significant subdomains of voice generation, leading to much demands on prompt-conditioned generation. Unlike common voice data, generating a singing voice requires an understanding of various associated vocal and musical characteristics, such as the vocal tone of the singer or emotional expressions. However, existing open-source audio-text datasets for voice generation tend to capture only a very limited range of attributes, often missing musical characteristics of the audio. To fill this gap, we introduce S2Cap, an audio-text pair dataset with a diverse set of attributes. S2Cap consists of pairs of textual prompts and music audio samples with a wide range of vocal and musical attributes, including pitch, volume, tempo, mood, singer’s gender and age, and musical genre and emotional expression. Utilizing S2Cap, we suggest an effective novel baseline algorithm for singing style captioning. Singing style captioning is a relative task to voice generation that generates text descriptions of vocal characteristics, which we first suggested. First, to mitigate the misalignment between the audio encoder and the text decoder, we present a novel mechanism called CRESCENDO, which utilizes positive-pair similarity learning to synchronize the embedding spaces of a pretrained audio encoder to get similar embeddings with a text encoder. We additionally supervise the model using the singer’s voice, which is demixed by the accompaniment. This supervision allows the model to more accurately capture vocal characteristics, leading to improved singing style captions that better reflect the style of the singer. The dataset and the codes are available at \bulurlthis https URL.
摘要:歌唱语音合成与转换已成为语音生成的重要领域,对提示条件生成提出了更高的要求。与常见的声音数据不同,生成歌唱声音需要了解各种相关的声音和音乐特征,例如歌手的声调或情感表达。然而,现有的用于语音生成的开源音频-文本数据集往往只捕获非常有限的属性,往往缺少音频的音乐特征。为了填补这一空白,我们引入了S2Cap,这是一个具有不同属性集的音频-文本对数据集。S2Cap由一对文字提示和音乐音频样本组成,具有广泛的声乐属性,包括音高、音量、节奏、情绪、歌手的性别和年龄,以及音乐流派和情感表达。利用S2Cap,我们提出了一种有效的歌唱风格字幕基线算法。歌唱风格字幕是一项相对于声音生成的任务,它生成声音特征的文本描述,这是我们最初提出的。首先,为了缓解音频编码器和文本解码器之间的错位,我们提出了一种新的机制–渐增机制,该机制利用正对相似性学习来同步预先训练的音频编码器的嵌入空间,从而获得与文本编码器相似的嵌入。我们还使用歌手的声音来监督模型,声音由伴奏分离。这种监督使模型能够更准确地捕捉声音特征,导致改进的演唱风格字幕更好地反映歌手的风格。数据集和代码可在此HTTPS URL上找到。

[NLP-38] A Benchmark Dataset with Larger Context for Non-Factoid Question Answering over Islamic Text
[NLP-38] 针对伊斯兰文本的非事实问题回答的具有更大背景的基准数据集

链接: https://arxiv.org/abs/2409.09844
作者: Faiza Qamar,Seemab Latif,Rabia Latif
关键词-EN: Prophet Muhammad, today digital era, digital era necessitates, era necessitates efficient, Accessing and comprehending
关键词-ZH: 先知穆罕默德,今天的数字时代,数字时代需要,时代需要高效、简化和理解
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accessing and comprehending religious texts, particularly the Quran (the sacred scripture of Islam) and Ahadith (the corpus of the sayings or traditions of the Prophet Muhammad), in today’s digital era necessitates efficient and accurate Question-Answering (QA) systems. Yet, the scarcity of QA systems tailored specifically to the detailed nature of inquiries about the Quranic Tafsir (explanation, interpretation, context of Quran for clarity) and Ahadith poses significant challenges. To address this gap, we introduce a comprehensive dataset meticulously crafted for QA purposes within the domain of Quranic Tafsir and Ahadith. This dataset comprises a robust collection of over 73,000 question-answer pairs, standing as the largest reported dataset in this specialized domain. Importantly, both questions and answers within the dataset are meticulously enriched with contextual information, serving as invaluable resources for training and evaluating tailored QA systems. However, while this paper highlights the dataset’s contributions and establishes a benchmark for evaluating QA performance in the Quran and Ahadith domains, our subsequent human evaluation uncovered critical insights regarding the limitations of existing automatic evaluation techniques. The discrepancy between automatic evaluation metrics, such as ROUGE scores, and human assessments became apparent. The human evaluation indicated significant disparities: the model’s verdict consistency with expert scholars ranged between 11% to 20%, while its contextual understanding spanned a broader spectrum of 50% to 90%. These findings underscore the necessity for evaluation techniques that capture the nuances and complexities inherent in understanding religious texts, surpassing the limitations of traditional automatic metrics.
摘要:在今天的数字时代,访问和理解宗教文本,特别是古兰经(伊斯兰教的神圣经文)和阿哈迪斯(先知穆罕默德的格言或传统语料库),需要高效和准确的问答系统。然而,缺乏专门针对关于古兰经塔夫塞尔(解释、解释、古兰经上下文以求清楚)和阿哈迪的详细询问的QA系统构成了巨大的挑战。为了弥补这一差距,我们引入了一个全面的数据集,在古兰经塔夫塞尔和阿哈迪斯的领域内为质量保证目的精心制作。该数据集包括73,000多个问答对的强大集合,是该专业领域中报告的最大数据集。重要的是,数据集中的问题和答案都精心地丰富了上下文信息,成为培训和评估定制的QA系统的宝贵资源。然而,尽管这篇文章强调了数据集的贡献,并为评估Quran和Ahadith领域的QA性能建立了一个基准,但我们随后的人工评估揭示了关于现有自动评估技术局限性的关键见解。自动评估指标(如Rouge分数)和人工评估之间的差异变得明显。人类的评估显示出显著的差异:该模型与专家学者的结论一致性在11%到20%之间,而它对上下文的理解跨越了50%到90%的更广泛的范围。这些发现强调了评估技术的必要性,这种评估技术捕捉到理解宗教经文所固有的细微差别和复杂性,超越了传统自动度量的局限性。

[NLP-39] Generating Synthetic Free-text Medical Records with Low Re-identification Risk using Masked Language Modeling
[NLP-39] 使用掩蔽语言建模生成具有低重新识别风险的合成自由文本医疗记录

链接: https://arxiv.org/abs/2409.09831
作者: Samuel Belkadi,Libo Ren,Nicolo Micheletti,Lifeng Han,Goran Nenadic
关键词-EN: Masked Language Modeling, Language Modeling, Masked Language, generates synthetic free-text, free-text medical records
关键词-ZH: 掩蔽语言建模、语言建模、掩蔽语言、生成合成自由文本、自由文本医疗记录
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper, we present a system that generates synthetic free-text medical records, such as discharge summaries, admission notes and doctor correspondences, using Masked Language Modeling (MLM). Our system is designed to preserve the critical information of the records while introducing significant diversity and minimizing re-identification risk. The system incorporates a de-identification component that uses Philter to mask Protected Health Information (PHI), followed by a Medical Entity Recognition (NER) model to retain key medical information. We explore various masking ratios and mask-filling techniques to balance the trade-off between diversity and fidelity in the synthetic outputs without affecting overall readability. Our results demonstrate that the system can produce high-quality synthetic data with significant diversity while achieving a HIPAA-compliant PHI recall rate of 0.96 and a low re-identification risk of 0.035. Furthermore, downstream evaluations using a NER task reveal that the synthetic data can be effectively used to train models with performance comparable to those trained on real data. The flexibility of the system allows it to be adapted for specific use cases, making it a valuable tool for privacy-preserving data generation in medical research and healthcare applications.
摘要:在本文中,我们提出了一个使用掩蔽语言建模(MLM)来生成合成自由文本病历的系统,如出院摘要、入院证明和医生通信。我们的系统旨在保留记录的关键信息,同时引入显著的多样性,并将重新识别的风险降至最低。该系统包括一个去身份识别组件,该组件使用Philter来屏蔽受保护的健康信息(PHI),然后是一个医疗实体识别(NER)模型来保留关键的医疗信息。我们探索了各种掩码比率和掩码填充技术,以在不影响整体可读性的情况下平衡合成输出中的多样性和保真度之间的平衡。实验结果表明,该系统可以产生具有显著多样性的高质量合成数据,同时获得了0.96%的符合HIPAA标准的PHI召回率和0.035的低重新识别风险。此外,使用NER任务的下游评估表明,合成数据可以有效地用于训练模型,其性能与基于真实数据训练的模型相当。该系统的灵活性使其能够适应特定的用例,使其成为医学研究和医疗保健应用程序中隐私保护数据生成的宝贵工具。

[NLP-40] GP-GPT: Large Language Model for Gene-Phenotype Mapping
[NLP-40] GP-GPT:基因表型映射的大型语言模型

链接: https://arxiv.org/abs/2409.09825
作者: Yanjun Lyu,Zihao Wu,Lu Zhang,Jing Zhang,Yiwei Li,Wei Ruan,Zhengliang Liu,Xiaowei Yu,Chao Cao,Tong Chen,Minheng Chen,Yan Zhuang,Xiang Li,Rongjie Liu,Chao Huang,Wentao Li,Tianming Liu,Dajiang Zhu
关键词-EN: Pre-trained large language, attracted increasing attention, natural language processing, biomedical domains due, Pre-trained large
关键词-ZH: 预训练大型语言,引起越来越多的关注,自然语言处理,生物医学领域,预训练大型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pre-trained large language models(LLMs) have attracted increasing attention in biomedical domains due to their success in natural language processing. However, the complex traits and heterogeneity of multi-sources genomics data pose significant challenges when adapting these models to the bioinformatics and biomedical field. To address these challenges, we present GP-GPT, the first specialized large language model for genetic-phenotype knowledge representation and genomics relation analysis. Our model is fine-tuned in two stages on a comprehensive corpus composed of over 3,000,000 terms in genomics, proteomics, and medical genetics, derived from multiple large-scale validated datasets and scientific publications. GP-GPT demonstrates proficiency in accurately retrieving medical genetics information and performing common genomics analysis tasks, such as genomics information retrieval and relationship determination. Comparative experiments across domain-specific tasks reveal that GP-GPT outperforms state-of-the-art LLMs, including Llama2, Llama3 and GPT-4. These results highlight GP-GPT’s potential to enhance genetic disease relation research and facilitate accurate and efficient analysis in the fields of genomics and medical genetics. Our investigation demonstrated the subtle changes of bio-factor entities’ representations in the GP-GPT, which suggested the opportunities for the application of LLMs to advancing gene-phenotype research.
摘要:预训练的大语言模型因其在自然语言处理方面的成功而在生物医学领域受到越来越多的关注。然而,多源基因组数据的复杂特性和异质性给将这些模型应用于生物信息学和生物医学领域带来了巨大的挑战。为了应对这些挑战,我们提出了第一个专门的大型语言模型GP-GPT,用于基因-表型知识表示和基因组关系分析。我们的模型在一个全面的语料库上分两个阶段进行了微调,该语料库由基因组学、蛋白质组学和医学遗传学中的300多万个术语组成,来自多个大规模验证的数据集和科学出版物。GP-GPT在准确检索医学遗传学信息和执行常见的基因组分析任务方面表现出熟练的能力,如基因组信息检索和关系确定。跨特定领域任务的对比实验表明,GP-GPT的性能优于最先进的LLMS,包括Llama2、Llama3和GPT-4。这些结果突出了GP-GPT在加强遗传病关系研究和促进基因组学和医学遗传学领域准确和有效的分析方面的潜力。我们的研究表明,生物因子实体在GP-GPT中的表达发生了细微的变化,这表明LLMS在推进基因表型研究方面有机会应用。

[NLP-41] Causal Inference with Large Language Model: A Survey
[NLP-41] 大语言模型的因果推理:一项调查

链接: https://arxiv.org/abs/2409.09822
作者: Jing Ma
关键词-EN: data mining capabilities, mathematical reasoning, medicine and economics, demanding a complicated, human knowledge
关键词-ZH: 数据挖掘能力、数学推理、医学和经济学,需要复杂的人类知识
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Causal inference has been a pivotal challenge across diverse domains such as medicine and economics, demanding a complicated integration of human knowledge, mathematical reasoning, and data mining capabilities. Recent advancements in natural language processing (NLP), particularly with the advent of large language models (LLMs), have introduced promising opportunities for traditional causal inference tasks. This paper reviews recent progress in applying LLMs to causal inference, encompassing various tasks spanning different levels of causation. We summarize the main causal problems and approaches, and present a comparison of their evaluation results in different causal scenarios. Furthermore, we discuss key findings and outline directions for future research, underscoring the potential implications of integrating LLMs in advancing causal inference methodologies.
摘要:因果推理一直是医学和经济学等各个领域的关键挑战,需要人类知识、数学推理和数据挖掘能力的复杂集成。自然语言处理(NLP)的最新进展,特别是随着大型语言模型(LLM)的出现,为传统的因果推理任务带来了有希望的机会。本文回顾了将LLM应用于因果推理的最新进展,涵盖了跨越不同因果关系水平的各种任务。我们总结了主要的因果问题和方法,并比较了不同因果情景下的评估结果。此外,我们还讨论了关键发现并概述了未来研究的方向,强调了整合法学硕士在推进因果推理方法论方面的潜在影响。

[NLP-42] Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models
[NLP-42] 使用参考对象推理路径在大型视觉语言模型中激发量化空间推理

链接: https://arxiv.org/abs/2409.09788
作者: Yuan-Hong Liao,Rafid Mahmood,Sanja Fidler,David Acuna
关键词-EN: demonstrating vision-language models’, recent advances demonstrating, advances demonstrating vision-language, describe complex relationships, distances remains underexplored
关键词-ZH: 展示视觉语言模型,最近的进展展示,展示视觉语言的进展,描述复杂的关系,距离仍然未充分解释
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 20 pages, 13 figures

点击查看摘要

Abstract:Despite recent advances demonstrating vision-language models’ (VLMs) abilities to describe complex relationships in images using natural language, their capability to quantitatively reason about object sizes and distances remains underexplored. In this work, we introduce a manually annotated benchmark, Q-Spatial Bench, with 271 questions across five categories designed for quantitative spatial reasoning and systematically investigate the performance of state-of-the-art VLMs on this task. Our analysis reveals that reasoning about distances between objects is particularly challenging for SoTA VLMs; however, some VLMs significantly outperform others, with an over 40-point gap between the two best performing models. We also make the surprising observation that the success rate of the top-performing VLM increases by 19 points when a reasoning path using a reference object emerges naturally in the response. Inspired by this observation, we develop a zero-shot prompting technique, SpatialPrompt, that encourages VLMs to answer quantitative spatial questions using reference objects as visual cues. By instructing VLMs to use reference objects in their reasoning paths via SpatialPrompt, Gemini 1.5 Pro, Gemini 1.5 Flash, and GPT-4V improve their success rates by over 40, 20, and 30 points, respectively. We emphasize that these significant improvements are obtained without needing more data, model architectural modifications, or fine-tuning.
摘要:尽管最近的研究表明视觉语言模型能够使用自然语言描述图像中的复杂关系,但它们对对象大小和距离的定量推理能力仍未得到充分研究。在这项工作中,我们介绍了一个人工标注的基准,Q-Space BENCH,包含五个类别的271个问题,用于定量空间推理,并系统地研究了最新的VLMS在这一任务上的性能。我们的分析表明,对于SOTA VLM来说,关于物体之间的距离的推理特别具有挑战性;然而,一些VLM的表现明显优于其他VLM,两个表现最好的模型之间的差距超过40点。我们还做了一个令人惊讶的观察,当使用参考对象的推理路径在响应中自然出现时,性能最好的VLM的成功率提高了19个点。受此启发,我们开发了一种零镜头提示技术SpatialPrompt,该技术鼓励VLM使用参考对象作为视觉线索来回答量化的空间问题。通过SpatialPrompt、Gemini 1.5 Pro、Gemini 1.5 Flash和GPT-4V指示VLM在其推理路径中使用参考对象,将其成功率分别提高了40、20和30点以上。我们强调,这些重大改进无需更多数据、模型架构修改或微调即可实现。

[NLP-43] Large Language Model Based Generative Error Correction: A Challenge and Baselines forSpeech Recognition Speaker Tagging and Emotion Recognition
[NLP-43] 基于大语言模型的生成式错误纠正:语音识别说话人标记和情感识别的挑战和基线

链接: https://arxiv.org/abs/2409.09785
作者: Chao-Han Huck Yang,Taejin Park,Yuan Gong,Yuanchao Li,Zhehuai Chen,Yen-Ting Lin,Chen Chen,Yuchen Hu,Kunal Dhawan,Piotr Żelasko,Chao Zhang,Yun-Nung Chen,Yu Tsao,Jagadeesh Balam,Boris Ginsburg,Sabato Marco Siniscalchi,Eng Siong Chng,Peter Bell,Catherine Lai,Shinji Watanabe,Andreas Stolcke
关键词-EN: text decoding results, enhance acoustic modeling, automatic speech recognition, ASR, recent advances
关键词-ZH: 文本解码结果、增强声学建模、自动语音识别、ASB、最新进展
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: IEEE SLT 2024. The initial draft version has been done in December 2023. Post-ASR Text Processing and Understanding Community: this https URL

点击查看摘要

Abstract:Given recent advances in generative AI technology, a key question is how large language models (LLMs) can enhance acoustic modeling tasks using text decoding results from a frozen, pretrained automatic speech recognition (ASR) model. To explore new capabilities in language modeling for speech processing, we introduce the generative speech transcription error correction (GenSEC) challenge. This challenge comprises three post-ASR language modeling tasks: (i) post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion recognition. These tasks aim to emulate future LLM-based agents handling voice-based interfaces while remaining accessible to a broad audience by utilizing open pretrained language models or agent-based APIs. We also discuss insights from baseline evaluations, as well as lessons learned for designing future evaluations.
摘要:鉴于生成式人工智能技术的最新进展,一个关键问题是大型语言模型(LLM)如何使用来自冻结、预训练的自动语音识别(ASB)模型的文本解码结果来增强声学建模任务。为了探索语音处理语言建模的新功能,我们引入了生成式语音转录错误纠正(GenSEC)挑战。该挑战包括三个后ASB语言建模任务:(i)后ASB转录纠正,(ii)说话人标记,和(iii)情感识别。这些任务旨在模拟未来的基于LLM的代理处理基于语音的界面,同时通过利用开放的预训练语言模型或基于代理的API来保持对广泛受众的访问。我们还讨论了基线评估的见解,以及设计未来评估的经验教训。

[NLP-44] ELMI: Interactive and Intelligent Sign Language Translation of Lyrics for Song Signing
[NLP-44] ELMI:歌曲签名歌词的交互式智能手语翻译

链接: https://arxiv.org/abs/2409.09760
作者: Suhyeon Yoo,Khai N. Truong,Young-Ho Kim
关键词-EN: Deaf and hearing, language remains cumbersome, sign language remains, video-sharing platforms, cumbersome and inaccessible
关键词-ZH: 聋人和听力,语言仍然笨重,手语仍然存在,视频共享平台,笨重且无法访问
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages excluding reference and appendix

点击查看摘要

Abstract:d/Deaf and hearing song-signers become prevalent on video-sharing platforms, but translating songs into sign language remains cumbersome and inaccessible. Our formative study revealed the challenges song-signers face, including semantic, syntactic, expressive, and rhythmic considerations in translations. We present ELMI, an accessible song-signing tool that assists in translating lyrics into sign language. ELMI enables users to edit glosses line-by-line, with real-time synced lyric highlighting and music video snippets. Users can also chat with a large language model-driven AI to discuss meaning, glossing, emoting, and timing. Through an exploratory study with 13 song-signers, we examined how ELMI facilitates their workflows and how song-signers leverage and receive an LLM-driven chat for translation. Participants successfully adopted ELMI to song-signing, with active discussions on the fly. They also reported improved confidence and independence in their translations, finding ELMI encouraging, constructive, and informative. We discuss design implications for leveraging LLMs in culturally sensitive song-signing translations.
摘要:D/聋人和听力歌曲签名者在视频分享平台上变得流行起来,但将歌曲翻译成手语仍然很麻烦,而且难以访问。我们的形成性研究揭示了歌曲签名者面临的挑战,包括语义、句法、表达和节奏方面的翻译考虑。我们提供ELMI,一个可访问的歌曲签名工具,帮助将歌词翻译成手语。Elmi使用户能够逐行编辑注解,并实时同步歌词突出显示和音乐视频片段。用户还可以与大型语言模型驱动的AI聊天,讨论含义、注释、表情和时间安排。通过对13名歌曲签名者的探索性研究,我们考察了ELMI如何促进他们的工作流程,以及歌曲签名者如何利用和接收LLM驱动的聊天进行翻译。参与者成功地将ELMI应用于歌曲签名,并在现场进行了积极的讨论。他们还报告说,他们在翻译中提高了信心和独立性,认为ELMI令人鼓舞、具有建设性和信息量。我们讨论了在文化敏感的歌曲签名翻译中利用LLMS的设计含义。

[NLP-45] Benchmarking LLMs in Political Content Text-Annotation: Proof-of-Concept with Toxicity and Incivility Data
[NLP-45] 政治内容文本注释中的LLM基准:毒性和不文明数据的概念证明

链接: https://arxiv.org/abs/2409.09741
作者: Bastián González-Bustamante
关键词-EN: political content, Nous Hermes, article benchmarked, benchmarked the ability, ability of OpenAI
关键词-ZH: 政治内容,Nous Hermes,文章基准,基准OpenAI的能力
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Paper prepared for delivery at the 8th Monash-Warwick-Zurich Text-as-Data Workshop, September 16-17, 2024: 11 pages, 3 tables, 3 figures

点击查看摘要

Abstract:This article benchmarked the ability of OpenAI’s GPTs and a number of open-source LLMs to perform annotation tasks on political content. We used a novel protest event dataset comprising more than three million digital interactions and created a gold standard that includes ground-truth labels annotated by human coders about toxicity and incivility on social media. We included in our benchmark Google’s Perspective algorithm, which, along with GPTs, was employed throughout their respective APIs while the open-source LLMs were deployed locally. The findings show that Perspective API using a laxer threshold, GPT-4o, and Nous Hermes 2 Mixtral outperform other LLM’s zero-shot classification annotations. In addition, Nous Hermes 2 and Mistral OpenOrca, with a smaller number of parameters, are able to perform the task with high performance, being attractive options that could offer good trade-offs between performance, implementing costs and computing time. Ancillary findings using experiments setting different temperature levels show that although GPTs tend to show not only excellent computing time but also overall good levels of reliability, only open-source LLMs ensure full reproducibility in the annotation.
摘要:本文对OpenAI的GPT和一些开源LLM执行政治内容注释任务的能力进行了基准测试。我们使用了一个包含300多万个数字互动的新型抗议事件数据集,并创建了一个黄金标准,其中包括由人类编码员注释的关于社交媒体上的有毒和不文明行为的地面事实标签。我们在基准测试中包含了Google的透视算法,该算法与GPT一起在各自的API中使用,而开源的LLM则在本地部署。结果表明,使用较宽松阈值的透视API、GPT-40和NOUS Hermes 2 Mixtral的性能优于其他LLM的零镜头分类注释。此外,Nous Hermes 2和Mistral OpenOrca的参数数量较少,能够以高性能执行任务,这是具有吸引力的选项,可以在性能、实施成本和计算时间之间提供良好的权衡。使用设置不同温度水平的实验的辅助发现表明,尽管GPT往往不仅显示出出色的计算时间,而且总体上具有良好的可靠性水平,但只有开源的LLM才能确保注释中的完全重现性。

[NLP-46] PersonaMark: Personalized LLM watermarking for model protection and user attribution
[NLP-46] PersonaMark:用于模型保护和用户归因的个性化LLM水印

链接: https://arxiv.org/abs/2409.09739
作者: Yuehan Zhang,Peizhuo Lv,Yinpeng Liu,Yongqiang Ma,Wei Lu,Xiaofeng Wang,Xiaozhong Liu,Jiawei Liu
关键词-EN: potential threats, rapid development, brings both convenience, convenience and potential, Text watermarking
关键词-ZH: 潜在威胁,快速发展,既带来便利,又带来潜力,文本水印
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:The rapid development of LLMs brings both convenience and potential threats. As costumed and private LLMs are widely applied, model copyright protection has become important. Text watermarking is emerging as a promising solution to AI-generated text detection and model protection issues. However, current text watermarks have largely ignored the critical need for injecting different watermarks for different users, which could help attribute the watermark to a specific individual. In this paper, we explore the personalized text watermarking scheme for LLM copyright protection and other scenarios, ensuring accountability and traceability in content generation. Specifically, we propose a novel text watermarking method PersonaMark that utilizes sentence structure as the hidden medium for the watermark information and optimizes the sentence-level generation algorithm to minimize disruption to the model’s natural generation process. By employing a personalized hashing function to inject unique watermark signals for different users, personalized watermarked text can be obtained. Since our approach performs on sentence level instead of token probability, the text quality is highly preserved. The injection process of unique watermark signals for different users is time-efficient for a large number of users with the designed multi-user hashing function. As far as we know, we achieved personalized text watermarking for the first time through this. We conduct an extensive evaluation of four different LLMs in terms of perplexity, sentiment polarity, alignment, readability, etc. The results demonstrate that our method maintains performance with minimal perturbation to the model’s behavior, allows for unbiased insertion of watermark information, and exhibits strong watermark recognition capabilities.
摘要:LLMS的快速发展在带来便利的同时,也带来了潜在的威胁。随着服装和私人LLM的广泛应用,模型版权保护变得非常重要。文本水印是人工智能生成的文本检测和模型保护问题的一种很有前途的解决方案。然而,当前的文本水印在很大程度上忽略了为不同用户注入不同水印的迫切需要,这可能有助于将水印归因于特定的个人。在本文中,我们探索了用于LLM版权保护和其他场景的个性化文本水印方案,以确保内容生成的可问责性和可追溯性。具体地说,我们提出了一种新的文本水印方法PersonaMark,该方法利用句子结构作为水印信息的隐藏媒介,并优化了句子级生成算法,以最大限度地减少对模型自然生成过程的干扰。通过使用个性化哈希函数为不同的用户注入唯一的水印信号,可以获得个性化的水印文本。由于我们的方法是在句子级别执行的,而不是在标记概率上执行,所以文本质量得到了很好的保持。利用所设计的多用户哈希函数,针对不同用户的唯一水印信号的注入过程对于大量用户来说是时效性的。据我们所知,我们第一次通过这个实现了个性化的文本水印。我们从困惑、情感极性、对齐、可读性等方面对四种不同的LLM进行了广泛的评估。结果表明,我们的方法在保持对模型行为的最小扰动的情况下保持了性能,允许无偏地插入水印信息,并且表现出很强的水印识别能力。

[NLP-47] Automatic Control With Human-Like Reasoning: Exploring Language Model Embodied Air Traffic Agents
[NLP-47] 具有类人推理的自动控制:探索空中交通代理的语言模型

链接: https://arxiv.org/abs/2409.09717
作者: Justas Andriuškevičius,Junzi Sun
关键词-EN: air traffic control, Recent developments, air traffic, traffic control studies, language models
关键词-ZH: 空中交通管制,最新发展,空中交通,交通管制研究,语言模型
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent developments in language models have created new opportunities in air traffic control studies. The current focus is primarily on text and language-based use cases. However, these language models may offer a higher potential impact in the air traffic control domain, thanks to their ability to interact with air traffic environments in an embodied agent form. They also provide a language-like reasoning capability to explain their decisions, which has been a significant roadblock for the implementation of automatic air traffic control. This paper investigates the application of a language model-based agent with function-calling and learning capabilities to resolve air traffic conflicts without human intervention. The main components of this research are foundational large language models, tools that allow the agent to interact with the simulator, and a new concept, the experience library. An innovative part of this research, the experience library, is a vector database that stores synthesized knowledge that agents have learned from interactions with the simulations and language models. To evaluate the performance of our language model-based agent, both open-source and closed-source models were tested. The results of our study reveal significant differences in performance across various configurations of the language model-based agents. The best-performing configuration was able to solve almost all 120 but one imminent conflict scenarios, including up to four aircraft at the same time. Most importantly, the agents are able to provide human-level text explanations on traffic situations and conflict resolution strategies. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2409.09717 [cs.AI] (or arXiv:2409.09717v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2409.09717 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:语言模型的最新发展为空中交通管制研究创造了新的机遇。当前的重点主要是基于文本和语言的用例。然而,这些语言模型可能在空中交通管制领域提供更高的潜在影响,这要归功于它们以具体的代理形式与空中交通环境交互的能力。他们还提供了类似语言的推理能力来解释他们的决定,这一直是实施自动空中交通管制的一个重大障碍。研究了一种具有函数调用和学习能力的基于语言模型的智能体在无人干预的空中交通冲突解决中的应用。这项研究的主要组成部分是基础的大型语言模型,允许代理与模拟器交互的工具,以及一个新的概念,体验库。这项研究的一个创新部分,体验库,是一个矢量数据库,其中存储了代理从与模拟和语言模型的交互中学习的综合知识。为了评估我们的基于语言模型的代理的性能,我们测试了开源和闭源模型。我们的研究结果显示,基于语言模型的代理的不同配置在性能上存在显著差异。性能最好的配置几乎能够解决所有120种情况,除了一种迫在眉睫的冲突情况,包括同时最多四架飞机。最重要的是,这些代理能够提供关于交通情况和冲突解决策略的人类级别的文本解释。主题:人工智能(cs.AI);计算与语言(cs.CL)引用为:arxiv:2409.09717cs.AIhttps://doi.org/10.48550/arXiv.2409.09717 Focus通过DataCite了解更多arxiv发布的DOI(等待注册)

[NLP-48] AlpaPICO: Extraction of PICO Frames from Clinical Trial Documents Using LLMs
[NLP-48] AlpaPICO:使用LLM从临床试验文件中提取PICO框架

链接: https://arxiv.org/abs/2409.09704
作者: Madhusudan Ghosh,Shrimon Mukherjee,Asmit Ganguly,Partha Basuchowdhuri,Sudip Kumar Naskar,Debasis Ganguly
关键词-EN: clinical trial reports, clinical trial, conduct systematic reviews, scrutinizing systematic reviews, systematic reviews
关键词-ZH: 临床试验报告,临床试验,进行系统审查,审查系统审查,系统审查
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted at Methods

点击查看摘要

Abstract:In recent years, there has been a surge in the publication of clinical trial reports, making it challenging to conduct systematic reviews. Automatically extracting Population, Intervention, Comparator, and Outcome (PICO) from clinical trial studies can alleviate the traditionally time-consuming process of manually scrutinizing systematic reviews. Existing approaches of PICO frame extraction involves supervised approach that relies on the existence of manually annotated data points in the form of BIO label tagging. Recent approaches, such as In-Context Learning (ICL), which has been shown to be effective for a number of downstream NLP tasks, require the use of labeled examples. In this work, we adopt ICL strategy by employing the pretrained knowledge of Large Language Models (LLMs), gathered during the pretraining phase of an LLM, to automatically extract the PICO-related terminologies from clinical trial documents in unsupervised set up to bypass the availability of large number of annotated data instances. Additionally, to showcase the highest effectiveness of LLM in oracle scenario where large number of annotated samples are available, we adopt the instruction tuning strategy by employing Low Rank Adaptation (LORA) to conduct the training of gigantic model in low resource environment for the PICO frame extraction task. Our empirical results show that our proposed ICL-based framework produces comparable results on all the version of EBM-NLP datasets and the proposed instruction tuned version of our framework produces state-of-the-art results on all the different EBM-NLP datasets. Our project is available at \urlthis https URL.
摘要:近年来,临床试验报告的发表激增,这给系统评价带来了挑战。从临床试验研究中自动提取总体、干预、比较器和结果(PICO)可以减轻传统上人工仔细审查系统评价的耗时过程。现有的Pico帧提取方法涉及监督方法,该方法依赖于生物标记形式的人工标注数据点的存在。最近的方法,如在上下文中学习(ICL),已被证明对一些下游NLP任务有效,需要使用标记的示例。在这项工作中,我们采用ICL策略,利用在大语言模型(LLM)的预训练阶段收集的预训练知识,在非监督设置中自动从临床试验文档中提取与Pico相关的术语,从而绕过大量标注数据实例。此外,为了展示LLM在具有大量标注样本的Oracle场景中的最高效率,我们采用了指令调优策略,使用低秩适配(LORA)在低资源环境下对巨型模型进行训练,以完成Pico帧提取任务。我们的实验结果表明,我们提出的基于ICL的框架在所有版本的EBM-NLP数据集上产生了可比较的结果,而我们提出的框架的指令调优版本在所有不同的EBM-NLP数据集上产生了最先进的结果。我们的项目可通过此HTTPS URL获取。

[NLP-49] ExploreSelf: Fostering User-driven Exploration and Reflection on Personal Challenges with Adaptive Guidance by Large Language Models
[NLP-49] ExploreSelf:通过大型语言模型的自适应指导,促进用户驱动的对个人挑战的探索和反思

链接: https://arxiv.org/abs/2409.09662
作者: Inhwa Song,SoHyun Park,Sachin R. Pendse,Jessica Lee Schleider,Munmun De Choudhury,Young-Ho Kim
关键词-EN: Expressing stressful experiences, Expressing stressful, physical health, thoughts and emotions, stressful experiences
关键词-ZH: 表达压力经历,表达压力、身体健康、思想和情绪、压力经历
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages excluding reference and appendix

点击查看摘要

Abstract:Expressing stressful experiences in words is proven to improve mental and physical health, but individuals often disengage with writing interventions as they struggle to organize their thoughts and emotions. Reflective prompts have been used to provide direction, and large language models (LLMs) have demonstrated the potential to provide tailored guidance. Current systems often limit users’ flexibility to direct their reflections. We thus present ExploreSelf, an LLM-driven application designed to empower users to control their reflective journey. ExploreSelf allows users to receive adaptive support through dynamically generated questions. Through an exploratory study with 19 participants, we examine how participants explore and reflect on personal challenges using ExploreSelf. Our findings demonstrate that participants valued the balance between guided support and freedom to control their reflective journey, leading to deeper engagement and insight. Building on our findings, we discuss implications for designing LLM-driven tools that promote user empowerment through effective reflective practices.
摘要:事实证明,用语言表达压力经历可以改善身心健康,但当个人难以组织自己的思想和情绪时,往往会脱离写作干预。反思性提示被用来提供指导,大型语言模型(LLM)已经证明了提供定制指导的潜力。目前的系统往往限制了用户引导他们反思的灵活性。因此,我们推出了ExplreSself,这是一个由LLM驱动的应用程序,旨在使用户能够控制他们的反思之旅。ExplreSself允许用户通过动态生成的问题获得自适应支持。通过一项有19名参与者参与的探索性研究,我们考察了参与者如何使用爆破性自我来探索和反思个人挑战。我们的研究结果表明,参与者重视引导支持和控制他们反思之旅的自由之间的平衡,从而导致更深层次的参与和洞察力。基于我们的发现,我们讨论了设计LLM驱动的工具的含义,这些工具通过有效的反思实践来促进用户赋权。

[NLP-50] Leveraging Open-Source Large Language Models for Native Language Identification
[NLP-50] 利用开源大型语言模型进行母语识别

链接: https://arxiv.org/abs/2409.09659
作者: Yee Man Ng,Ilia Markov
关键词-EN: Native Language Identification, Native Language, Language Identification, applications in forensics, Native
关键词-ZH: 母语识别,母语,语言识别,取证应用,母语
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Native Language Identification (NLI) - the task of identifying the native language (L1) of a person based on their writing in the second language (L2) - has applications in forensics, marketing, and second language acquisition. Historically, conventional machine learning approaches that heavily rely on extensive feature engineering have outperformed transformer-based language models on this task. Recently, closed-source generative large language models (LLMs), e.g., GPT-4, have demonstrated remarkable performance on NLI in a zero-shot setting, including promising results in open-set classification. However, closed-source LLMs have many disadvantages, such as high costs and undisclosed nature of training data. This study explores the potential of using open-source LLMs for NLI. Our results indicate that open-source LLMs do not reach the accuracy levels of closed-source LLMs when used out-of-the-box. However, when fine-tuned on labeled training data, open-source LLMs can achieve performance comparable to that of commercial LLMs.
摘要:母语识别(NLI)-根据一个人的第二语言(L2)写作来识别一个人的母语(L1)的任务-在取证、市场营销和第二语言习得中有应用。从历史上看,严重依赖广泛特征工程的传统机器学习方法在这一任务上的表现优于基于变压器的语言模型。最近,闭源产生式大型语言模型(LLM),如GPT-4,在零镜头环境下的NLI上表现出了显著的性能,包括在开集分类中的良好结果。然而,闭源LLMS有许多缺点,例如成本高,训练数据的性质不公开。这项研究探索了将开源LLM用于NLI的潜力。我们的结果表明,当开箱即用时,开源的LLMS没有达到闭源LLMS的精度水平。然而,当对标记的训练数据进行微调时,开源的LLMS可以获得与商业LLMS相当的性能。

[NLP-51] Unveiling Gender Bias in Large Language Models : Using Teachers Evaluation in Higher Education As an Example
[NLP-51] 揭露大型语言模型中的性别偏见:以高等教育教师评估为例

链接: https://arxiv.org/abs/2409.09652
作者: Yuanning Huang
关键词-EN: Large Language Model, generated teacher evaluations, higher education setting, Language Model, bias in Large
关键词-ZH: 大型语言模型、生成的教师评估、高等教育环境、语言模型、大型偏见
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper investigates gender bias in Large Language Model (LLM)-generated teacher evaluations in higher education setting, focusing on evaluations produced by GPT-4 across six academic subjects. By applying a comprehensive analytical framework that includes Odds Ratio (OR) analysis, Word Embedding Association Test (WEAT), sentiment analysis, and contextual analysis, this paper identified patterns of gender-associated language reflecting societal stereotypes. Specifically, words related to approachability and support were used more frequently for female instructors, while words related to entertainment were predominantly used for male instructors, aligning with the concepts of communal and agentic behaviors. The study also found moderate to strong associations between male salient adjectives and male names, though career and family words did not distinctly capture gender biases. These findings align with prior research on societal norms and stereotypes, reinforcing the notion that LLM-generated text reflects existing biases.
摘要:本研究以GPT-4对六门学科的教师评价为研究对象,考察了基于大型语言模型(LLM)的高等教育教师评价中的性别偏见。通过应用包括优势比(OR)分析、词语嵌入联想测试(Weat)、情感分析和语境分析在内的综合分析框架,本研究确定了反映社会刻板印象的性别相关语言的模式。具体地说,与平易近人和支持相关的词汇更多地用于女性教练,而与娱乐相关的词汇主要用于男性教练,这与公共行为和代理行为的概念一致。这项研究还发现,男性突出形容词和男性名字之间存在中等到强烈的关联,尽管职业和家庭词汇并不明显地体现出性别偏见。这些发现与之前关于社会规范和刻板印象的研究相一致,强化了LLM生成的文本反映了现有偏见的概念。

[NLP-52] A Simple HMM with Self-Supervised Representations for Phone Segmentation
[NLP-52] 用于电话分割的具有自我监督表示的简单Markov

链接: https://arxiv.org/abs/2409.09646
作者: Gene-Ping Yang,Hao Tang
关键词-EN: segmentation remains challenging, unsupervised phonetic segmentation, remains challenging, phonetic segmentation remains, unsupervised phonetic
关键词-ZH: 分割仍然具有挑战性,无监督语音分割,仍然具有挑战性,语音分割仍然,无监督语音
类目: Computation and Language (cs.CL)
备注: Accepted to SLT 2024

点击查看摘要

Abstract:Despite the recent advance in self-supervised representations, unsupervised phonetic segmentation remains challenging. Most approaches focus on improving phonetic representations with self-supervised learning, with the hope that the improvement can transfer to phonetic segmentation. In this paper, contrary to recent approaches, we show that peak detection on Mel spectrograms is a strong baseline, better than many self-supervised approaches. Based on this finding, we propose a simple hidden Markov model that uses self-supervised representations and features at the boundaries for phone segmentation. Our results demonstrate consistent improvements over previous approaches, with a generalized formulation allowing versatile design adaptations.
摘要:尽管最近在自我监督表示方面取得了进展,但无监督语音分割仍然具有挑战性。大多数方法都专注于通过自我监督学习来改善语音表示,并希望这种改进能够转移到语音分割。在本文中,与最近的方法相反,我们表明梅尔光谱图上的峰值检测是一个强大的基线,比许多自我监督方法更好。基于这一发现,我们提出了一个简单的隐藏马尔科夫模型,该模型使用自我监督表示和边界的特征进行电话分割。我们的结果表明,与之前的方法相比,我们的结果得到了一致的改进,并且通用的公式允许进行多种设计调整。

[NLP-53] owards understanding evolution of science through language model series
[NLP-53] 通过语言模型系列理解科学的进化

链接: https://arxiv.org/abs/2409.09636
作者: Junjie Dong,Zhuoqi Lyu,Qing Ke
关键词-EN: language models designed, models designed specifically, designed specifically, specifically to capture, capture the temporal
关键词-ZH: 设计的语言模型,专门设计的模型,专门设计的,专门为了捕捉,捕捉时态
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Digital Libraries (cs.DL)
备注:

点击查看摘要

Abstract:We introduce AnnualBERT, a series of language models designed specifically to capture the temporal evolution of scientific text. Deviating from the prevailing paradigms of subword tokenizations and “one model to rule them all”, AnnualBERT adopts whole words as tokens and is composed of a base RoBERTa model pretrained from scratch on the full-text of 1.7 million arXiv papers published until 2008 and a collection of progressively trained models on arXiv papers at an annual basis. We demonstrate the effectiveness of AnnualBERT models by showing that they not only have comparable performances in standard tasks but also achieve state-of-the-art performances on domain-specific NLP tasks as well as link prediction tasks in the arXiv citation network. We then utilize probing tasks to quantify the models’ behavior in terms of representation learning and forgetting as time progresses. Our approach enables the pretrained models to not only improve performances on scientific text processing tasks but also to provide insights into the development of scientific discourse over time. The series of the models is available at this https URL.
摘要:我们介绍了AnnualBERT,这是一系列专门为捕捉科学文本的时间演变而设计的语言模型。AnnualBERT与当前流行的子词标记化和“一种模式统治所有人”的范式不同,它采用全词作为表征,由一个基础Roberta模型组成,该模型从零开始对2008年前发表的170万篇arxiv论文的全文进行预训练,以及每年关于arxiv论文的逐步训练模型的集合。我们通过证明AnnualBERT模型的有效性,表明它们不仅在标准任务中具有类似的性能,而且在arxiv引文网络中特定领域的NLP任务以及链接预测任务上也获得了最先进的性能。然后,我们利用探测任务来量化模型的行为,即随着时间的推移,在表征、学习和遗忘方面的行为。我们的方法使预先训练的模型不仅可以提高科学文本处理任务的性能,还可以提供对科学语篇随时间发展的洞察力。该系列型号可在此HTTPS URL上获得。

[NLP-54] Confidence Estimation for LLM-Based Dialogue State Tracking
[NLP-54] 基于LLM的对话状态跟踪的置信度估计

链接: https://arxiv.org/abs/2409.09629
作者: Yi-Jyun Sun,Suvodip Dey,Dilek Hakkani-Tur,Gokhan Tur
关键词-EN: critical for Conversational, large language models, preventing over-reliance, outputs is critical, large language
关键词-ZH: 对于对话式大型语言模型至关重要,防止过度依赖,输出至关重要,大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Estimation of a model’s confidence on its outputs is critical for Conversational AI systems based on large language models (LLMs), especially for reducing hallucination and preventing over-reliance. In this work, we provide an exhaustive exploration of methods, including approaches proposed for open- and closed-weight LLMs, aimed at quantifying and leveraging model uncertainty to improve the reliability of LLM-generated responses, specifically focusing on dialogue state tracking (DST) in task-oriented dialogue systems (TODS). Regardless of the model type, well-calibrated confidence scores are essential to handle uncertainties, thereby improving model performance. We evaluate four methods for estimating confidence scores based on softmax, raw token scores, verbalized confidences, and a combination of these methods, using the area under the curve (AUC) metric to assess calibration, with higher AUC indicating better calibration. We also enhance these with a self-probing mechanism, proposed for closed models. Furthermore, we assess these methods using an open-weight model fine-tuned for the task of DST, achieving superior joint goal accuracy (JGA). Our findings also suggest that fine-tuning open-weight LLMs can result in enhanced AUC performance, indicating better confidence score calibration.
摘要:在基于大语言模型的对话式人工智能系统中,估计模型对其输出的置信度是至关重要的,特别是对于减少幻觉和防止过度依赖。在这项工作中,我们详尽地探索了各种方法,包括针对开放和闭合权重LLMS的方法,旨在量化和利用模型不确定性来提高LLM生成响应的可靠性,特别是关注面向任务的对话系统(TODS)中的对话状态跟踪(DST)。无论模型类型如何,良好校准的置信度分数对于处理不确定性至关重要,从而提高模型性能。我们评估了四种基于Softmax、原始令牌分数、言语置信度以及这些方法的组合来估计置信度分数的方法,使用曲线下面积(AUC)度量来评估校准,AUC越高,表明校准越好。我们还通过为封闭模型提出的自我探测机制来增强这些功能。此外,我们使用开放权重模型来评估这些方法,该模型针对DST任务进行了微调,获得了优越的联合目标精度(JGA)。我们的发现还表明,微调开放重量LLMS可以导致增强的AUC性能,表明更好的置信度分数校准。

[NLP-55] Enhancing Text Annotation through Rationale-Driven Collaborative Few-Shot Prompting
[NLP-55] 通过日历驱动的协作少镜头绘图增强文本注释

链接: https://arxiv.org/abs/2409.09615
作者: Jianfei Wu,Xubin Wang,Weijia Jia
关键词-EN: human bias, data annotation process, susceptible to human, complicates the management, management of increasingly
关键词-ZH: 人为的偏见、数据注释过程,容易受到人为的影响,使管理变得复杂,管理的日益
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The traditional data annotation process is often labor-intensive, time-consuming, and susceptible to human bias, which complicates the management of increasingly complex datasets. This study explores the potential of large language models (LLMs) as automated data annotators to improve efficiency and consistency in annotation tasks. By employing rationale-driven collaborative few-shot prompting techniques, we aim to improve the performance of LLMs in text annotation. We conduct a rigorous evaluation of six LLMs across four benchmark datasets, comparing seven distinct methodologies. Our results demonstrate that collaborative methods consistently outperform traditional few-shot techniques and other baseline approaches, particularly in complex annotation tasks. Our work provides valuable insights and a robust framework for leveraging collaborative learning methods to tackle challenging text annotation tasks.
摘要:传统的数据注释过程通常是劳动密集型、耗时的,并且容易受到人为偏见的影响,这使得日益复杂的数据集的管理变得复杂。本研究探索了大型语言模型(LLM)作为自动数据注释器的潜力,以提高注释任务的效率和一致性。通过采用理性驱动的协作少镜头提示技术,我们的目标是提高LLM在文本注释中的性能。我们对四个基准数据集的六个LLM进行了严格评估,比较了七种不同的方法。我们的结果表明,协作方法始终优于传统的少镜头技术和其他基线方法,特别是在复杂的注释任务中。我们的工作提供了宝贵的见解和一个强大的框架,用于利用协作学习方法来解决具有挑战性的文本注释任务。

[NLP-56] Rethinking KenLM: Good and Bad Model Ensembles for Efficient Text Quality Filtering in Large Web Corpora
[NLP-56] 重新思考KenLM:大型网络数据库中高效文本质量过滤的好模型和坏模型集合

链接: https://arxiv.org/abs/2409.09613
作者: Yungi Kim,Hyunsoo Ha,Sukyung Lee,Jihoo Kim,Seonghoon Yang,Chanjun Park
关键词-EN: efficiently filtering large, filtering large web, large web corpora, train large language, large language models
关键词-ZH: 有效过滤大型、过滤大型网络、大型网络库、训练大型语言、大型语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the increasing demand for substantial amounts of high-quality data to train large language models (LLMs), efficiently filtering large web corpora has become a critical challenge. For this purpose, KenLM, a lightweight n-gram-based language model that operates on CPUs, is widely used. However, the traditional method of training KenLM utilizes only high-quality data and, consequently, does not explicitly learn the linguistic patterns of low-quality data. To address this issue, we propose an ensemble approach that leverages two contrasting KenLMs: (i) Good KenLM, trained on high-quality data; and (ii) Bad KenLM, trained on low-quality data. Experimental results demonstrate that our approach significantly reduces noisy content while preserving high-quality content compared to the traditional KenLM training method. This indicates that our method can be a practical solution with minimal computational overhead for resource-constrained environments.
摘要:随着对大量高质量数据来训练大型语言模型(LLM)的需求不断增加,有效过滤大型网络库已成为一项关键挑战。为此,广泛使用KenLM,这是一种在中央处理器上运行的轻量级基于n元语法的语言模型。然而,训练KenLM的传统方法仅利用高质量数据,因此没有明确学习低质量数据的语言模式。为了解决这个问题,我们提出了一种利用两种对比鲜明的KenLM的集成方法:(i)Good KenLM,在高质量数据上训练;和(ii)Bad KenLM,在低质量数据上训练。实验结果表明,与传统的KenLM训练方法相比,我们的方法显着减少了含噪内容,同时保留了高质量的内容。这表明我们的方法对于资源受限的环境来说可以成为一种实用的解决方案,并且计算费用最低。

[NLP-57] owards Data-Centric RLHF: Simple Metrics for Preference Dataset Comparison
[NLP-57] owards以数据为中心的HLHF:用于偏好数据集比较的简单收件箱

链接: https://arxiv.org/abs/2409.09603
作者: Judy Hanwen Shen,Archit Sharma,Jun Qin
关键词-EN: aligning language models, goal of aligning, aligning language, preferences requires data, human preferences requires
关键词-ZH: 调整语言模型,调整目标,调整语言,偏好需要数据,人类偏好需要
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Working Paper

点击查看摘要

Abstract:The goal of aligning language models to human preferences requires data that reveal these preferences. Ideally, time and money can be spent carefully collecting and tailoring bespoke preference data to each downstream application. However, in practice, a select few publicly available preference datasets are often used to train reward models for reinforcement learning from human feedback (RLHF). While new preference datasets are being introduced with increasing frequency, there are currently no existing efforts to measure and compare these datasets. In this paper, we systematically study preference datasets through three perspectives: scale, label noise, and information content. We propose specific metrics for each of these perspectives and uncover different axes of comparison for a better understanding of preference datasets. Our work is a first step towards a data-centric approach to alignment by providing perspectives that aid in training efficiency and iterative data collection for RLHF.
摘要:使语言模型与人类偏好保持一致的目标需要揭示这些偏好的数据。理想情况下,可以花时间和金钱仔细收集和定制每个下游应用程序的定制首选项数据。然而,在实践中,选择一些公开可用的偏好数据集经常被用来训练从人类反馈的强化学习(RLHF)的奖励模型。虽然新的偏好数据集正在以越来越高的频率推出,但目前还没有对这些数据集进行衡量和比较的现有努力。本文从尺度、标签噪声和信息量三个角度对偏好数据集进行了系统的研究。我们为这些视角中的每一个提出了具体的衡量标准,并揭示了不同的比较轴,以更好地理解偏好数据集。我们的工作是迈向以数据为中心的方法的第一步,通过提供有助于RLHF培训效率和迭代数据收集的视角。

[NLP-58] Improving Statistical Significance in Human Evaluation of Automatic Metrics via Soft Pairwise Accuracy
[NLP-58] 通过软成对准确性提高自动收件箱人类评估的统计意义

链接: https://arxiv.org/abs/2409.09598
作者: Brian Thompson,Nitika Mathur,Daniel Deutsch,Huda Khayrallah
关键词-EN: emulates human judgments, Soft Pairwise Accuracy, human judgments, Pairwise Accuracy, automatic metric judgments
关键词-ZH: 模拟人类判断、软成对准确性、人类判断、成对准确性、自动指标判断
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Selecting an automatic metric that best emulates human judgments is often non-trivial, because there is no clear definition of “best emulates.” A meta-metric is required to compare the human judgments to the automatic metric judgments, and metric rankings depend on the choice of meta-metric. We propose Soft Pairwise Accuracy (SPA), a new meta-metric that builds on Pairwise Accuracy (PA) but incorporates the statistical significance of both the human judgments and the metric judgments. SPA allows for more fine-grained comparisons between systems than a simplistic binary win/loss, and addresses a number of shortcomings with PA: it is more stable with respect to both the number of systems and segments used for evaluation, it mitigates the issue of metric ties due to quantization, and it produces more statistically significant results. SPA was selected as the official system-level metric for the 2024 WMT metric shared task.
摘要:选择最能模仿人类判断的自动指标通常并不简单,因为“最佳模仿”没有明确的定义。“需要元指标来将人类判断与自动指标判断进行比较,而指标排名取决于元指标的选择。我们提出了软成对准确度(SPA),这是一种新的元指标,它建立在成对准确度(PA)的基础上,但结合了人类判断和指标判断的统计重要性。与简单化的二元赢/输相比,SPA允许在系统之间进行更细粒度的比较,并解决了PA的许多缺点:它在用于评估的系统和段的数量方面都更稳定,它缓解了由于量化而导致的指标关系问题,并且它会产生更有统计意义的结果。SPA被选为2024年WMT指标共享任务的官方系统级指标。

[NLP-59] ValueCompass: A Framework of Fundamental Values for Human-AI Alignment
[NLP-59] ValueCompass:人机一致的基本价值框架

链接: https://arxiv.org/abs/2409.09586
作者: Hua Shen,Tiffany Knearem,Reshmi Ghosh,Yu-Ju Yang,Tanushree Mitra,Yun Huang
关键词-EN: increasingly critical, diverse range, range of individuals, alignment, Choose Own Goals
关键词-ZH: 日益重要、多样化、个人范围、一致、选择自己的目标
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As AI systems become more advanced, ensuring their alignment with a diverse range of individuals and societal values becomes increasingly critical. But how can we capture fundamental human values and assess the degree to which AI systems align with them? We introduce ValueCompass, a framework of fundamental values, grounded in psychological theory and a systematic review, to identify and evaluate human-AI alignment. We apply ValueCompass to measure the value alignment of humans and language models (LMs) across four real-world vignettes: collaborative writing, education, public sectors, and healthcare. Our findings uncover risky misalignment between humans and LMs, such as LMs agreeing with values like “Choose Own Goals”, which are largely disagreed by humans. We also observe values vary across vignettes, underscoring the necessity for context-aware AI alignment strategies. This work provides insights into the design space of human-AI alignment, offering foundations for developing AI that responsibly reflects societal values and ethics.
摘要:随着人工智能系统变得越来越先进,确保它们与不同的个人和社会价值观保持一致变得越来越重要。但我们如何才能捕捉到人类的基本价值观,并评估人工智能系统与这些价值观保持一致的程度呢?我们引入了ValueCompass,这是一个基于心理学理论和系统回顾的基本价值观框架,用于识别和评估人类与人工智能的一致性。我们应用ValueCompass来衡量人类和语言模型(LMS)在四个现实世界中的价值一致性:协作写作、教育、公共部门和医疗保健。我们的发现揭示了人类和LMS之间危险的错位,例如LMS同意像“选择自己的目标”这样的价值观,而这在很大程度上是人类不同意的。我们还观察到不同小插曲的价值观不同,强调了背景感知人工智能对齐战略的必要性。这项工作为人类-人工智能对齐的设计空间提供了见解,为开发负责任地反映社会价值观和伦理的人工智能提供了基础。

[NLP-60] RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation
[NLP-60] RethinkMCTS:在蒙特卡洛树搜索中提炼错误思想以生成代码

链接: https://arxiv.org/abs/2409.09584
作者: Qingyao Li,Wei Xia,Kounianhua Du,Xinyi Dai,Ruiming Tang,Yasheng Wang,Yong Yu,Weinan Zhang
关键词-EN: LLM agents enhanced, yielded notable performances, LLM agents, search, tree search algorithms
关键词-ZH: LLM代理增强,产生了显着的性能,LLM代理,搜索,树搜索算法
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:LLM agents enhanced by tree search algorithms have yielded notable performances in code generation. However, current search algorithms in this domain suffer from low search quality due to several reasons: 1) Ineffective design of the search space for the high-reasoning demands of code generation tasks, 2) Inadequate integration of code feedback with the search algorithm, and 3) Poor handling of negative feedback during the search, leading to reduced search efficiency and quality. To address these challenges, we propose to search for the reasoning process of the code and use the detailed feedback of code execution to refine erroneous thoughts during the search. In this paper, we introduce RethinkMCTS, which employs the Monte Carlo Tree Search (MCTS) algorithm to conduct thought-level searches before generating code, thereby exploring a wider range of strategies. More importantly, we construct verbal feedback from fine-grained code execution feedback to refine erroneous thoughts during the search. This ensures that the search progresses along the correct reasoning paths, thus improving the overall search quality of the tree by leveraging execution feedback. Through extensive experiments, we demonstrate that RethinkMCTS outperforms previous search-based and feedback-based code generation baselines. On the HumanEval dataset, it improves the pass@1 of GPT-3.5-turbo from 70.12 to 89.02 and GPT-4o-mini from 87.20 to 94.51. It effectively conducts more thorough exploration through thought-level searches and enhances the search quality of the entire tree by incorporating rethink operation.
摘要:由树搜索算法增强的LLM代理在代码生成方面取得了显著的性能。然而,目前该领域的搜索算法存在搜索质量不高的问题,原因有:1)搜索空间的设计不能满足代码生成任务的高推理需求;2)代码反馈与搜索算法的结合不够充分;3)搜索过程中对负反馈处理不力,导致搜索效率和质量下降。为了应对这些挑战,我们建议搜索代码的推理过程,并使用代码执行的详细反馈来提炼搜索过程中的错误想法。在本文中,我们引入了ReThink MCTS,它使用蒙特卡洛树搜索(MCTS)算法在生成代码之前进行思想层搜索,从而探索更广泛的策略。更重要的是,我们从细粒度的代码执行反馈中构建语言反馈,以提炼搜索过程中的错误想法。这确保了搜索沿着正确的推理路径进行,从而通过利用执行反馈来提高树的整体搜索质量。通过大量的实验,我们证明了ReThink MCTS的性能优于以前的基于搜索和基于反馈的代码生成基线。在HumanEval数据集上,它将GPT-3.5-TURBO的PASS@1从70.12改进到89.02,将GPT-40-mini的PASS@1从87.20改进到94.51。它通过思想层面的搜索有效地进行了更深入的探索,并通过结合重新思考操作来提高整个树的搜索质量。

[NLP-61] hesis proposal: Are We Losing Textual Diversity to Natural Language Processing?
[NLP-61] hesis提案:我们是否正在因自然语言处理而失去文本多样性?

链接: https://arxiv.org/abs/2409.09568
作者: Josef Jon
关键词-EN: Natural Language Processing, widely used Natural, Natural Language, Processing algorithms possibly, handle and produce
关键词-ZH: 自然语言处理,广泛使用的自然、自然语言、处理算法,处理和产生
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This thesis argues that the currently widely used Natural Language Processing algorithms possibly have various limitations related to the properties of the texts they handle and produce. With the wide adoption of these tools in rapid progress, we must ask what these limitations are and what are the possible implications of integrating such tools even more deeply into our daily lives. As a testbed, we have chosen the task of Neural Machine Translation (NMT). Nevertheless, we aim for general insights and outcomes, applicable even to current Large Language Models (LLMs). We ask whether the algorithms used in NMT have inherent inductive biases that are beneficial for most types of inputs but might harm the processing of untypical texts. To explore this hypothesis, we define a set of measures to quantify text diversity based on its statistical properties, like uniformity or rhythmicity of word-level surprisal, on multiple scales (sentence, discourse, language). We then conduct a series of experiments to investigate whether NMT systems struggle with maintaining the diversity of such texts, potentially reducing the richness of the language generated by these systems, compared to human translators. We search for potential causes of these limitations rooted in training objectives and decoding algorithms. Our ultimate goal is to develop alternatives that do not enforce uniformity in the distribution of statistical properties in the output and that allow for better global planning of the translation, taking into account the intrinsic ambiguity of the translation task. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2409.09568 [cs.CL] (or arXiv:2409.09568v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.09568 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:本文认为,目前广泛使用的自然语言处理算法可能存在与其处理和产生的文本的属性有关的各种限制。随着这些工具的广泛采用和快速发展,我们必须问一问,这些限制是什么,将这些工具更深入地融入我们的日常生活可能产生什么影响。作为实验平台,我们选择了神经机器翻译(NMT)这一课题。然而,我们的目标是获得一般性的见解和结果,甚至适用于当前的大型语言模型(LLM)。我们询问NMT中使用的算法是否存在固有的归纳偏差,这些偏差对大多数类型的输入都是有利的,但可能会损害非典型文本的处理。为了探索这一假设,我们定义了一套基于文本统计特性的量化文本多样性的度量,如词级惊喜的一致性或节奏性,在多个尺度(句子、语篇、语言)上。然后,我们进行了一系列实验,以调查与人工翻译相比,NMT系统是否难以保持此类文本的多样性,潜在地降低了这些系统生成的语言的丰富性。我们寻找这些限制的潜在原因,根源于训练目标和解码算法。我们的最终目标是开发不强制统计特性在输出中的分布一致的替代方案,并允许更好地对翻译进行全局规划,同时考虑到翻译任务的内在模糊性。主题:计算和语言(cs.CL)引用为:arxiv:2409.09568cs.CLhttps://doi.org/10.48550/arXiv.2409.09568 Focus通过DataCite了解更多arxiv发布的DOI(等待注册)

[NLP-62] ASR Error Correction using Large Language Models
[NLP-62] 使用大型语言模型的ASB错误纠正

链接: https://arxiv.org/abs/2409.09554
作者: Rao Ma,Mengjie Qian,Mark Gales,Kate Knill
关键词-EN: Automatic Speech Recognition, refining Automatic Speech, Speech Recognition, Automatic Speech, refining Automatic
关键词-ZH: 自动语音识别,精炼自动语音,语音识别,自动语音,精炼自动
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to IEEE Transactions on Audio, Speech and Language Processing

点击查看摘要

Abstract:Error correction (EC) models play a crucial role in refining Automatic Speech Recognition (ASR) transcriptions, enhancing the readability and quality of transcriptions. Without requiring access to the underlying code or model weights, EC can improve performance and provide domain adaptation for black-box ASR systems. This work investigates the use of large language models (LLMs) for error correction across diverse scenarios. 1-best ASR hypotheses are commonly used as the input to EC models. We propose building high-performance EC models using ASR N-best lists which should provide more contextual information for the correction process. Additionally, the generation process of a standard EC model is unrestricted in the sense that any output sequence can be generated. For some scenarios, such as unseen domains, this flexibility may impact performance. To address this, we introduce a constrained decoding approach based on the N-best list or an ASR lattice. Finally, most EC models are trained for a specific ASR system requiring retraining whenever the underlying ASR system is changed. This paper explores the ability of EC models to operate on the output of different ASR systems. This concept is further extended to zero-shot error correction using LLMs, such as ChatGPT. Experiments on three standard datasets demonstrate the efficacy of our proposed methods for both Transducer and attention-based encoder-decoder ASR systems. In addition, the proposed method can serve as an effective method for model ensembling.
摘要:纠错(EC)模型在改进自动语音识别(ASR)转录、提高转录的可读性和质量方面起着至关重要的作用。EC无需访问底层代码或模型权重,即可提高性能并为黑盒ASR系统提供域自适应。这项工作调查了大型语言模型(LLM)在不同场景中用于纠错的使用情况。1-最佳ASR假设通常用作EC模型的输入。我们建议使用ASR N-BEST列表构建高性能的EC模型,该列表应该为校正过程提供更多的上下文信息。此外,标准EC模型的生成过程是不受限制的,因为可以生成任何输出序列。对于某些场景,例如看不见的域,这种灵活性可能会影响性能。为了解决这一问题,我们引入了一种基于N-Best列表或ASR格的约束解码方法。最后,大多数EC模型都是针对特定的ASR系统进行培训的,只要底层ASR系统发生变化,就需要重新培训。本文探讨EC模型对不同ASR系统的输出进行运算的能力。这一概念进一步扩展到使用诸如ChatGPT的LLMS的零触发纠错。在三个标准数据集上的实验证明了我们提出的方法在传感器和基于注意力的编解码器ASR系统上的有效性。此外,该方法还可以作为一种有效的模型集成方法。

[NLP-63] Planning Transformer: Long-Horizon Offline Reinforcement Learning with Planning Tokens AAAI
[NLP-63] 规划Transformer:使用规划令牌的长视野离线强化学习

链接: https://arxiv.org/abs/2409.09513
作者: Joseph Clinton,Robert Lieck
关键词-EN: Supervised learning approaches, Decision Transformer, offline reinforcement learning, Supervised learning, utilizing the Decision
关键词-ZH: 监督学习方法,决策Transformer,离线强化学习,监督学习,利用决策
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 5 figures, Submitted to AAAI

点击查看摘要

Abstract:Supervised learning approaches to offline reinforcement learning, particularly those utilizing the Decision Transformer, have shown effectiveness in continuous environments and for sparse rewards. However, they often struggle with long-horizon tasks due to the high compounding error of auto-regressive models. To overcome this limitation, we go beyond next-token prediction and introduce Planning Tokens, which contain high-level, long time-scale information about the agent’s future. Predicting dual time-scale tokens at regular intervals enables our model to use these long-horizon Planning Tokens as a form of implicit planning to guide its low-level policy and reduce compounding error. This architectural modification significantly enhances performance on long-horizon tasks, establishing a new state-of-the-art in complex D4RL environments. Additionally, we demonstrate that Planning Tokens improve the interpretability of the model’s policy through the interpretable plan visualisations and attention map.
摘要:用于离线强化学习的监督学习方法,特别是那些使用决策转换器的方法,在连续环境和稀疏回报下表现出了有效性。然而,由于自回归模型的高组合误差,他们经常在长期任务中挣扎。为了克服这一限制,我们超越了下一个令牌预测,引入了规划令牌,它包含关于代理未来的高级别、长时间尺度的信息。通过定期预测双时间尺度标记,我们的模型可以使用这些长期计划标记作为一种隐式规划形式来指导其低层策略,减少组合误差。这一架构改进显著提高了长期任务的性能,在复杂的D4RL环境中建立了新的最先进水平。此外,我们还通过可解释的计划可视化和注意图证明了规划令牌改善了模型政策的可解释性。

[NLP-64] Comparing Retrieval-Augmentation and Parameter-Efficient Fine-Tuning for Privacy-Preserving Personalization of Large Language Models
[NLP-64] 比较大型语言模型的隐私保护个性化的检索增强和参数高效微调

链接: https://arxiv.org/abs/2409.09510
作者: Alireza Salemi,Hamed Zamani
关键词-EN: large language models, personalizing large language, Privacy-preserving methods, language models, large language
关键词-ZH: 大型语言模型、个性化大型语言、隐私保护方法、语言模型、大型语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Privacy-preserving methods for personalizing large language models (LLMs) are relatively under-explored. There are two schools of thought on this topic: (1) generating personalized outputs by personalizing the input prompt through retrieval augmentation from the user’s personal information (RAG-based methods), and (2) parameter-efficient fine-tuning of LLMs per user that considers efficiency and space limitations (PEFT-based methods). This paper presents the first systematic comparison between two approaches on a wide range of personalization tasks using seven diverse datasets. Our results indicate that RAG-based and PEFT-based personalization methods on average yield 14.92% and 1.07% improvements over the non-personalized LLM, respectively. We find that combining RAG with PEFT elevates these improvements to 15.98%. Additionally, we identify a positive correlation between the amount of user data and PEFT’s effectiveness, indicating that RAG is a better choice for cold-start users (i.e., user’s with limited personal data).
摘要:针对大型语言模型个性化的隐私保护方法研究相对较少。在这个问题上有两个学派:(1)通过从用户的个人信息中检索增强来个性化输入提示来生成个性化输出(基于RAG的方法),以及(2)考虑到效率和空间限制的每个用户对LLMS进行参数高效的微调(基于PEFT的方法)。本文首次使用七个不同的数据集对两种方法在广泛的个性化任务上进行了系统的比较。我们的结果表明,基于RAG的个性化方法和基于PEFT的个性化方法的平均效率分别比非个性化LLM提高了14.92%和1.07%。我们发现,将RAG和PEFT相结合,可以将这些改善提高到15.98%。此外,我们发现用户数据量与PEFT的有效性之间存在正相关,这表明RAG对于冷启动用户(即个人数据有限的用户)是更好的选择。

[NLP-65] Uddessho: An Extensive Benchmark Dataset for Multimodal Author Intent Classification in Low-Resource Bangla Language
[NLP-65] Uddessho:低资源孟加拉语多模式作者意图分类的广泛基准数据集

链接: https://arxiv.org/abs/2409.09504
作者: Fatema Tuj Johora Faria,Mukaffi Bin Moin,Md. Mahfuzur Rahman,Md Morshed Alam Shanto,Asif Iftekher Fahim,Md. Moinul Hoque
关键词-EN: social media posts, daily information sharing, social media, media posts, thoughts and opinions
关键词-ZH: 社交媒体帖子、每日信息共享、社交媒体、媒体帖子、想法和观点
类目: Computation and Language (cs.CL)
备注: Accepted for publication in “18th International Conference on Information Technology and Applications (ICITA 2024)”

点击查看摘要

Abstract:With the increasing popularity of daily information sharing and acquisition on the Internet, this paper introduces an innovative approach for intent classification in Bangla language, focusing on social media posts where individuals share their thoughts and opinions. The proposed method leverages multimodal data with particular emphasis on authorship identification, aiming to understand the underlying purpose behind textual content, especially in the context of varied user-generated posts on social media. Current methods often face challenges in low-resource languages like Bangla, particularly when author traits intricately link with intent, as observed in social media posts. To address this, we present the Multimodal-based Author Bangla Intent Classification (MABIC) framework, utilizing text and images to gain deeper insights into the conveyed intentions. We have created a dataset named “Uddessho,” comprising 3,048 instances sourced from social media. Our methodology comprises two approaches for classifying textual intent and multimodal author intent, incorporating early fusion and late fusion techniques. In our experiments, the unimodal approach achieved an accuracy of 64.53% in interpreting Bangla textual intent. In contrast, our multimodal approach significantly outperformed traditional unimodal methods, achieving an accuracy of 76.19%. This represents an improvement of 11.66%. To our best knowledge, this is the first research work on multimodal-based author intent classification for low-resource Bangla language social media posts.
摘要:随着互联网上日常信息共享和获取的日益普及,本文介绍了一种创新的孟加拉语意图分类方法,重点针对社交媒体帖子,在这些帖子中,个人分享他们的想法和观点。拟议的方法利用多模式数据,特别强调作者身份识别,旨在了解文本内容背后的潜在目的,特别是在社交媒体上各种用户生成帖子的背景下。在像孟加拉语这样的低资源语言中,当前的方法经常面临挑战,特别是当作者的特征与意图错综复杂地联系在一起时,就像在社交媒体帖子中观察到的那样。为了解决这个问题,我们提出了基于多模式的作者孟加拉意图分类(MABIC)框架,利用文本和图像来更深入地了解所传达的意图。我们已经创建了一个名为“Uddessho”的数据集,其中包含3,048个来自社交媒体的实例。我们的方法包括两种分类文本意图和多通道作者意图的方法,结合了早期融合和后期融合技术。在我们的实验中,单峰方法对孟加拉语篇意图的理解准确率达到了64.53%。相比之下,我们的多模式方法显著优于传统的单模式方法,达到了76.19%的准确率。这意味着提高了11.66%。据我们所知,这是首次针对低资源的孟加拉语社交媒体帖子进行基于多模式的作者意图分类的研究工作。

[NLP-66] Synthetic4Health: Generating Annotated Synthetic Clinical Letters
[NLP-66] Compositive 4 Health:生成注释的合成临床信件

链接: https://arxiv.org/abs/2409.09501
作者: Libo Ren,Samuel Belkadi,Lifeng Han,Warren Del-Pinto,Goran Nenadic
关键词-EN: medical research, clinical-related datasets, widely applied, clinical, Named Entity Recognition
关键词-ZH: 医学研究、临床相关数据集、广泛应用、临床、命名实体识别
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ongoing work, 48 pages

点击查看摘要

Abstract:Since clinical letters contain sensitive information, clinical-related datasets can not be widely applied in model training, medical research, and teaching. This work aims to generate reliable, various, and de-identified synthetic clinical letters. To achieve this goal, we explored different pre-trained language models (PLMs) for masking and generating text. After that, we worked on Bio_ClinicalBERT, a high-performing model, and experimented with different masking strategies. Both qualitative and quantitative methods were used for evaluation. Additionally, a downstream task, Named Entity Recognition (NER), was also implemented to assess the usability of these synthetic letters. The results indicate that 1) encoder-only models outperform encoder-decoder models. 2) Among encoder-only models, those trained on general corpora perform comparably to those trained on clinical data when clinical information is preserved. 3) Additionally, preserving clinical entities and document structure better aligns with our objectives than simply fine-tuning the model. 4) Furthermore, different masking strategies can impact the quality of synthetic clinical letters. Masking stopwords has a positive impact, while masking nouns or verbs has a negative effect. 5) For evaluation, BERTScore should be the primary quantitative evaluation metric, with other metrics serving as supplementary references. 6) Contextual information does not significantly impact the models’ understanding, so the synthetic clinical letters have the potential to replace the original ones in downstream tasks. Comments: ongoing work, 48 pages Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.09501 [cs.CL] (or arXiv:2409.09501v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.09501 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:由于临床信函包含敏感信息,临床相关数据集不能广泛应用于模型训练、医学研究和教学。这项工作的目的是生成可靠的、多样化的和未识别的合成临床信函。为了实现这一目标,我们探索了不同的预训练语言模型(PLM)来屏蔽和生成文本。在那之后,我们研究了高性能模型Bio_ClinicalBERT,并试验了不同的掩蔽策略。评价采用定性和定量相结合的方法。此外,还实施了名为实体识别(NER)的下游任务,以评估这些合成字母的可用性。结果表明:1)单纯编码者模型的性能优于编解码者模型。2)在仅有编码者的模型中,在保留临床信息的情况下,在普通语料库上训练的模型与在临床数据上训练的模型的性能相当。3)此外,与简单地微调模型相比,保留临床实体和文档结构更符合我们的目标。4)此外,不同的掩饰策略会影响合成临床书信的质量。掩饰停用词有积极的影响,而掩饰名词或动词则有负面影响。5)对于评价,应以BERTScore为主要定量评价指标,其他指标为补充参考。6)背景信息对模型的理解没有显著影响,因此人工合成的临床信函有可能在下游任务中取代原始的临床信函。评论:正在进行的工作,48页主题:计算和语言(cs.CL);人工智能(cs.AI)引用为:arxiv:2409.09501cs.CLhttps://doi.org/10.48550/arXiv.2409.09501 Focus通过DataCite了解更多arxiv发布的DOI(等待注册)

[NLP-67] Keeping Humans in the Loop: Human-Centered Automated Annotation with Generative AI
[NLP-67] 让人类保持在循环中:以人为本的自动化注释与生成性AI

链接: https://arxiv.org/abs/2409.09467
作者: Nicholas Pangakis,Samuel Wolken
关键词-EN: generative large language, social media research, large language models, media research, compelling use case
关键词-ZH: 生成式大型语言、社交媒体研究、大型语言模型、媒体研究、引人注目的用例
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated text annotation is a compelling use case for generative large language models (LLMs) in social media research. Recent work suggests that LLMs can achieve strong performance on annotation tasks; however, these studies evaluate LLMs on a small number of tasks and likely suffer from contamination due to a reliance on public benchmark datasets. Here, we test a human-centered framework for responsibly evaluating artificial intelligence tools used in automated annotation. We use GPT-4 to replicate 27 annotation tasks across 11 password-protected datasets from recently published computational social science articles in high-impact journals. For each task, we compare GPT-4 annotations against human-annotated ground-truth labels and against annotations from separate supervised classification models fine-tuned on human-generated labels. Although the quality of LLM labels is generally high, we find significant variation in LLM performance across tasks, even within datasets. Our findings underscore the importance of a human-centered workflow and careful evaluation standards: Automated annotations significantly diverge from human judgment in numerous scenarios, despite various optimization strategies such as prompt tuning. Grounding automated annotation in validation labels generated by humans is essential for responsible evaluation.
摘要:在社交媒体研究中,自动文本标注是生成性大型语言模型(LLM)的一个引人注目的用例。最近的工作表明,LLMS可以在标注任务中获得良好的性能;然而,这些研究在少数任务上评估LLMS,并且由于依赖公共基准数据集而可能受到污染。在这里,我们测试了一个以人为中心的框架,用于负责任地评估自动标注中使用的人工智能工具。我们使用GPT-4在11个受密码保护的数据集上复制27个注释任务,这些任务来自最近在高影响力期刊上发表的计算社会科学文章。对于每个任务,我们将GPT-4注释与人类注释的基本事实标签进行比较,并与根据人类生成的标签进行微调的单独监督分类模型的注释进行比较。尽管LLM标签的质量通常很高,但我们发现LLM性能在不同任务之间存在显著差异,甚至在数据集内也是如此。我们的发现强调了以人为中心的工作流程和仔细的评估标准的重要性:尽管有各种优化策略,如即时调整,但在许多情况下,自动注释与人的判断明显不同。在人工生成的验证标签中建立自动注释对于负责任的评估至关重要。

[NLP-68] Rethinking the Influence of Source Code on Test Case Generation
[NLP-68] 重新思考源代码对测试用例生成的影响

链接: https://arxiv.org/abs/2409.09464
作者: Dong Huang,Jie M. Zhang,Mingzhe Du,Mark Harman,Heming Cui
关键词-EN: Large language models, Large language, assist test generation, code, language models
关键词-ZH: 大型语言模型、大型语言、辅助测试生成、代码、语言模型
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: 23 pages

点击查看摘要

Abstract:Large language models (LLMs) have been widely applied to assist test generation with the source code under test provided as the context. This paper aims to answer the question: If the source code under test is incorrect, will LLMs be misguided when generating tests? The effectiveness of test cases is measured by their accuracy, coverage, and bug detection effectiveness. Our evaluation results with five open- and six closed-source LLMs on four datasets demonstrate that incorrect code can significantly mislead LLMs in generating correct, high-coverage, and bug-revealing tests. For instance, in the HumanEval dataset, LLMs achieve 80.45% test accuracy when provided with task descriptions and correct code, but only 57.12% when given task descriptions and incorrect code. For the APPS dataset, prompts with correct code yield tests that detect 39.85% of the bugs, while prompts with incorrect code detect only 19.61%. These findings have important implications for the deployment of LLM-based testing: using it on mature code may help protect against future regression, but on early-stage immature code, it may simply bake in errors. Our findings also underscore the need for further research to improve LLMs resilience against incorrect code in generating reliable and bug-revealing tests.
摘要:大型语言模型(LLM)已被广泛应用于以被测源代码为上下文辅助测试生成。本文旨在回答这样一个问题:如果被测试的源代码不正确,LLM在生成测试时会被误导吗?测试用例的有效性由它们的准确性、覆盖率和错误检测有效性来衡量。我们在四个数据集上对五个开放源代码和六个封闭源代码的LLM进行的评估结果表明,错误的代码可能会严重误导LLM生成正确、高覆盖率和揭示错误的测试。例如,在HumanEval数据集中,当提供任务描述和正确代码时,LLMS达到80.45%的测试准确率,但当提供任务描述和错误代码时,仅达到57.12%。对于应用程序数据集,代码正确的提示测试可以检测39.85%的错误,而代码不正确的提示只能检测19.61%的错误。这些发现对基于LLM的测试的部署具有重要意义:在成熟代码上使用它可能有助于防止未来的回归,但在早期阶段的不成熟代码上,它可能只会导致错误。我们的发现也强调了进一步研究的必要性,以提高LLMS在生成可靠和揭示错误的测试时对不正确代码的弹性。

[NLP-69] Enhancing LLM Problem Solving with REAP: Reflection Explicit Problem Deconstruction and Advanced Prompting
[NLP-69] 用REAP增强LLM问题解决:反思显式问题解构和高级预算处理

链接: https://arxiv.org/abs/2409.09415
作者: Ryan Lingo,Martin Arroyo,Rajeev Chhajer
关键词-EN: Large Language Models, natural language processing, transformed natural language, Large Language, Explicit Problem Deconstruction
关键词-ZH: 大型语言模型、自然语言处理、转换自然语言、大型语言、显式问题解构
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 524 pages, 3 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have transformed natural language processing, yet improving their problem-solving capabilities, particularly for complex, reasoning-intensive tasks, remains a persistent challenge. This paper introduces the REAP (Reflection, Explicit Problem Deconstruction, and Advanced Prompting) method, an innovative approach within the dynamic context generation framework. REAP guides LLMs through reflection on the query, deconstructing it into manageable components, and generating relevant context to enhance the solution process. We evaluated REAP using a dataset designed to expose LLM limitations, comparing zero-shot prompting with REAP-enhanced prompts across six state-of-the-art models: OpenAI’s o1-preview, o1-mini, GPT-4o, GPT-4o-mini, Google’s Gemini 1.5 Pro, and Claude 3.5 Sonnet. The results demonstrate notable performance gains, with o1-mini improving by 40.97%, GPT-4o by 66.26%, and GPT-4o-mini by 112.93%. Despite the already strong baseline performance of OpenAI’s o1-preview, modest gains were observed. Beyond performance improvements, REAP offers a cost-effective solution; for example, GPT-4o-mini, which is approximately 100 times cheaper than o1-preview, delivered competitive results. REAP also improves the clarity of model outputs, making it easier for humans to understand the reasoning behind the results and simplifying the process of identifying and addressing any issues. These findings demonstrate REAP’s potential to greatly improve the capabilities of LLMs, providing both better performance and increased cost-efficiency across a wide range of applications.
摘要:大型语言模型已经改变了自然语言处理,但提高它们的问题解决能力,特别是对于复杂的、推理密集型任务,仍然是一个长期的挑战。本文介绍了动态语境生成框架中的一种创新方法–REAP(反思、显式问题解构和高级提示)方法。REAP通过对查询的反思来指导LLM,将其解构为可管理的组件,并生成相关上下文以增强解决过程。我们使用旨在暴露LLM局限性的数据集对REAP进行了评估,比较了六种最先进的机型:OpenAI的o1预览、o1-mini、GPT-4o、GPT-4o-mini、Google的Gemini 1.5 Pro和Claude 3.5十四行诗中的零镜头提示和增强的REAP提示。结果表明,性能显著提高,o1-mini提高了40.97%,gpt-4o提高了66.26%,gpt-4o-mini提高了112.93%。尽管OpenAI的o1预览的基线表现已经很强劲,但仍有小幅增长。除了性能改进,REAP还提供了一个具有成本效益的解决方案;例如,GPT-40-mini大约比o1-view便宜100倍,提供了具有竞争力的结果。REAP还提高了模型输出的清晰度,使人类更容易理解结果背后的原因,并简化了识别和解决任何问题的过程。这些发现表明,REAP有潜力极大地提高低成本管理系统的能力,在广泛的应用程序中提供更好的性能和更高的成本效益。

[NLP-70] Constructive Approach to Bidirectional Causation between Qualia Structure and Language Emergence
[NLP-70] 定性结构与语言出现双向因果关系的建设性研究

链接: https://arxiv.org/abs/2409.09413
作者: Tadahiro Taniguchi,Masafumi Oizumi,Noburo Saji,Takato Horii,Naotsugu Tsuchiya
关键词-EN: termed qualia structure, language emergence, language, internal representations, relational structure
关键词-ZH: 称为感觉结构、语言涌现、语言、内部表示、关系结构
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 4 Figures

点击查看摘要

Abstract:This paper presents a novel perspective on the bidirectional causation between language emergence and relational structure of subjective experiences, termed qualia structure, and lays out the constructive approach to the intricate dependency between the two. We hypothesize that languages with distributional semantics, e.g., syntactic-semantic structures, may have emerged through the process of aligning internal representations among individuals, and such alignment of internal representations facilitates more structured language. This mutual dependency is suggested by the recent advancements in AI and symbol emergence robotics, and collective predictive coding (CPC) hypothesis, in particular. Computational studies show that neural network-based language models form systematically structured internal representations, and multimodal language models can share representations between language and perceptual information. This perspective suggests that language emergence serves not only as a mechanism creating a communication tool but also as a mechanism for allowing people to realize shared understanding of qualitative experiences. The paper discusses the implications of this bidirectional causation in the context of consciousness studies, linguistics, and cognitive science, and outlines future constructive research directions to further explore this dynamic relationship between language emergence and qualia structure.
摘要:本文对语言涌现与主观经验的关系结构之间的双向因果关系提出了一种新的视角,称为质量结构,并对两者之间的复杂依赖关系提出了建设性的方法。我们假设,具有分布语义的语言,例如句法-语义结构,可能是通过在个体之间对齐内部表征的过程而出现的,这种内部表征的对齐促进了更结构化的语言。人工智能和符号显现机器人技术的最新进展,特别是集体预测编码(CPC)假设,表明了这种相互依赖。计算研究表明,基于神经网络的语言模型形成系统结构化的内部表征,多通道语言模型可以共享语言和感知信息之间的表征。这一观点表明,语言涌现不仅是一种创造交际工具的机制,也是一种让人们实现对定性经验的共同理解的机制。本文讨论了这种双向因果关系在意识研究、语言学和认知科学中的意义,并概述了未来建设性的研究方向,以进一步探索语言涌现和质量结构之间的这种动态关系。

[NLP-71] owards Diverse and Efficient Audio Captioning via Diffusion Models
[NLP-71] owards通过扩散模型提供多样化且高效的音频字幕

链接: https://arxiv.org/abs/2409.09401
作者: Manjie Xu,Chenxing Li,Xinyi Tu,Yong Ren,Ruibo Fu,Wei Liang,Dong Yu
关键词-EN: introduce Diffusion-based Audio, efficient audio captioning, Diffusion-based Audio Captioning, tailored for diverse, diverse and efficient
关键词-ZH: 引入基于扩散的音频,高效的音频字幕,基于扩散的音频字幕,为多元化、多元化和高效定制
类目: Computation and Language (cs.CL)
备注: this https URL

点击查看摘要

Abstract:We introduce Diffusion-based Audio Captioning (DAC), a non-autoregressive diffusion model tailored for diverse and efficient audio captioning. Although existing captioning models relying on language backbones have achieved remarkable success in various captioning tasks, their insufficient performance in terms of generation speed and diversity impede progress in audio understanding and multimedia applications. Our diffusion-based framework offers unique advantages stemming from its inherent stochasticity and holistic context modeling in captioning. Through rigorous evaluation, we demonstrate that DAC not only achieves SOTA performance levels compared to existing benchmarks in the caption quality, but also significantly outperforms them in terms of generation speed and diversity. The success of DAC illustrates that text generation can also be seamlessly integrated with audio and visual generation tasks using a diffusion backbone, paving the way for a unified, audio-related generative model across different modalities.
摘要:介绍了基于扩散的音频字幕(DAC),这是一种为多样化和高效的音频字幕量身定做的非自回归扩散模型。尽管现有的依赖于语言骨干的字幕模型在各种字幕任务中取得了显著的成功,但其在生成速度和多样性方面的不足阻碍了音频理解和多媒体应用的进步。我们的基于扩散的框架提供了独特的优势,源于其固有的随机性和字幕的整体上下文建模。通过严格的评估,我们证明了DAC不仅在字幕质量方面达到了与现有基准测试相比的SOTA性能水平,而且在生成速度和多样性方面也显著优于现有基准测试。DAC的成功表明,文本生成也可以使用传播主干与音频和视频生成任务无缝集成,从而为跨不同模式的与音频相关的统一生成模式铺平道路。

[NLP-72] LLM-Powered Ensemble Learning for Paper Source Tracing: A GPU-Free Approach
[NLP-72] 用于论文来源追踪的法学硕士支持的Ensemble学习:无GOP方法

链接: https://arxiv.org/abs/2409.09383
作者: Kunlong Chen,Junjun Wang,Zhaoqun Chen,Kunjin Chen,Yitian Chen
关键词-EN: KDD CUP, source tracing competition, paper source tracing, tracing competition, source tracing
关键词-ZH: KDD CUP、来源追踪竞赛、纸质来源追踪、追踪竞赛、来源追踪
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We participated in the KDD CUP 2024 paper source tracing competition and achieved the 3rd place. This competition tasked participants with identifying the reference sources (i.e., ref-sources, as referred to by the organizers of the competition) of given academic papers. Unlike most teams that addressed this challenge by fine-tuning pre-trained neural language models such as BERT or ChatGLM, our primary approach utilized closed-source large language models (LLMs). With recent advancements in LLM technology, closed-source LLMs have demonstrated the capability to tackle complex reasoning tasks in zero-shot or few-shot scenarios. Consequently, in the absence of GPUs, we employed closed-source LLMs to directly generate predicted reference sources from the provided papers. We further refined these predictions through ensemble learning. Notably, our method was the only one among the award-winning approaches that did not require the use of GPUs for model training. Code available at this https URL.
摘要:我们参加了KDD杯2024论文溯源大赛,获得了第三名。这项竞赛的任务是让参赛者确定特定学术论文的参考资料来源(即竞赛组织者所指的参考资料来源)。与大多数团队通过微调预先训练的神经语言模型(如BERT或ChatGLM)来应对这一挑战不同,我们的主要方法使用了封闭源代码的大型语言模型(LLM)。随着LLM技术的最新进步,闭源LLM已经展示了在零射击或少射击场景下处理复杂推理任务的能力。因此,在没有图形处理器的情况下,我们使用闭源LLMS从所提供的论文中直接生成预测的参考源。我们通过集成学习进一步完善了这些预测。值得注意的是,我们的方法是获奖方法中唯一不需要使用GPU进行模型培训的方法。此HTTPS URL上提供的代码。

[NLP-73] Generating Event-oriented Attribution for Movies via Two-Stage Prefix-Enhanced Multimodal LLM
[NLP-73] 通过两阶段前置增强型多模式LLM为电影生成事件导向的归因

链接: https://arxiv.org/abs/2409.09362
作者: Yuanjie Lyu,Tong Xu,Zihan Niu,Bo Peng,Jing Ke,Enhong Chen
关键词-EN: social media platforms, semantic-rich services, prosperity of social, social media, media platforms
关键词-ZH: 社交媒体平台、语义丰富的服务、社交、社交媒体、媒体平台的繁荣
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The prosperity of social media platforms has raised the urgent demand for semantic-rich services, e.g., event and storyline attribution. However, most existing research focuses on clip-level event understanding, primarily through basic captioning tasks, without analyzing the causes of events across an entire movie. This is a significant challenge, as even advanced multimodal large language models (MLLMs) struggle with extensive multimodal information due to limited context length. To address this issue, we propose a Two-Stage Prefix-Enhanced MLLM (TSPE) approach for event attribution, i.e., connecting associated events with their causal semantics, in movie videos. In the local stage, we introduce an interaction-aware prefix that guides the model to focus on the relevant multimodal information within a single clip, briefly summarizing the single event. Correspondingly, in the global stage, we strengthen the connections between associated events using an inferential knowledge graph, and design an event-aware prefix that directs the model to focus on associated events rather than all preceding clips, resulting in accurate event attribution. Comprehensive evaluations of two real-world datasets demonstrate that our framework outperforms state-of-the-art methods.
摘要:社交媒体平台的繁荣对事件和故事情节归属等语义丰富的服务提出了迫切的需求。然而,现有的大多数研究都集中在剪辑级别的事件理解上,主要是通过基本的字幕任务,而没有分析整个电影中事件的原因。这是一个巨大的挑战,因为由于上下文长度有限,即使是高级的多模式大型语言模型(MLLM)也难以处理大量的多模式信息。为了解决这一问题,我们提出了一种两阶段前缀增强的MLLM(TSPE)方法,用于电影视频中的事件属性,即将关联事件与其因果语义联系起来。在局部阶段,我们引入了一个交互感知前缀,引导模型专注于单个剪辑内的相关多通道信息,简要总结单个事件。相应地,在全局阶段,我们使用推理知识图加强关联事件之间的联系,并设计一个事件感知前缀,引导模型关注关联事件而不是之前的所有片段,从而获得准确的事件属性。对两个真实世界数据集的综合评估表明,我们的框架比最先进的方法性能更好。

[NLP-74] Overcoming linguistic barriers in code assistants: creating a QLoRA adapter to improve support for Russian-language code writing instructions
[NLP-74] 克服代码助手中的语言障碍:创建QLoRA适配器以改善对俄语代码编写指令的支持

链接: https://arxiv.org/abs/2409.09353
作者: C. B. Pronin,A. V. Volosova,A. V. Ostroukh,Yu. N. Strogov
关键词-EN: popular language model, Russian language, base model, model, Russian
关键词-ZH: 流行语言模型,俄语,基础模型,模型,俄语
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:In this paper, an approach to training and evaluating an adapter model for the popular language model “zephyr-7b-beta” is described. The adapter was developed to improve the performance of the base model in tasks related to programming and understanding the Russian language. Considering the high quality of the original model in tasks in the English language, the goal of the research was to expand its linguistic and technical spectrum. The proposed adapter was trained using a large and diverse dataset, including question-answer pairs related to programming, as well code-related texts in Russian language. The applied training methodology ensures an improvement in the model’s quality of answers in understanding and generating Python code based on Russian instructions. We evaluated the performance of the base model with the installed adapter using various metrics, comparing it to the base model as well as other state-of-the-art models in this field. The obtained results showed significant improvement, both in tasks related to writing Python code and in processing the Russian language, confirming the effectiveness of the proposed adapter.
摘要:本文描述了一种训练和评估流行语言模型“ZEPHIR-7B-BETA”的适配器模型的方法。开发适配器是为了提高基本模型在与编程和理解俄语有关的任务中的性能。考虑到原始模式在英语任务中的高质量,本研究的目标是扩大其语言和技术范围。拟议的适配器是使用一个大型和多样化的数据集进行培训的,其中包括与编程有关的问答对以及与俄语代码有关的文本。应用训练方法确保了模型在理解和生成基于俄语指令的Python代码方面的答案质量得到提高。我们使用各种指标评估了安装了适配器的基本模型的性能,并将其与基本模型以及该领域的其他最先进模型进行了比较。获得的结果表明,在编写Python代码和处理俄语方面的任务都有显著的改进,证实了所提出的适配器的有效性。

[NLP-75] Efficient Fine-Tuning of Large Language Models for Automated Medical Documentation
[NLP-75] 自动医疗文档的大型语言模型的高效微调

链接: https://arxiv.org/abs/2409.09324
作者: Hui Yi Leong,Yi Fan Gao,Ji Shuai,Uktu Pamuksuz
关键词-EN: electronic health records, Scientific research, direct patient care, health records, desk work
关键词-ZH: 电子健康记录、科学研究、直接患者护理、健康记录、办公桌工作
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 4 pages, 3 Figures, 3 Tables, This is a preprint version of the article. The final version will be published in the proceedings of the IEEE conference

点击查看摘要

Abstract:Scientific research indicates that for every hour spent in direct patient care, physicians spend nearly two additional hours on administrative tasks, particularly on electronic health records (EHRs) and desk work. This excessive administrative burden not only reduces the time available for patient care but also contributes to physician burnout and inefficiencies in healthcare delivery. To address these challenges, this study introduces MediGen, a fine-tuned large language model (LLM) designed to automate the generation of medical reports from medical dialogues. By leveraging state-of-the-art methodologies for fine-tuning open-source pretrained models, including LLaMA3-8B, MediGen achieves high accuracy in transcribing and summarizing clinical interactions. The fine-tuned LLaMA3-8B model demonstrated promising results, achieving a ROUGE score of 58% and a BERTScore-F1 of 72%, indicating its effectiveness in generating accurate and clinically relevant medical reports. These findings suggest that MediGen has the potential to significantly reduce the administrative workload on physicians, improving both healthcare efficiency and physician well-being.
摘要:科学研究表明,医生在直接护理患者方面每花费一小时,就会在行政任务上额外花费近两个小时,特别是在电子健康记录(EHR)和案头工作上。这种过多的行政负担不仅减少了可用于病人护理的时间,而且还导致了医生的疲惫和医疗保健服务的低效。为了应对这些挑战,本研究引入了MediGen,这是一个微调的大型语言模型(LLM),旨在从医疗对话中自动生成医疗报告。通过利用最先进的方法微调开源预先训练的模型,包括LLaMA3-8B,MediGen在转录和总结临床交互方面实现了高精度。微调的LLaMA3-8B模型显示了良好的结果,达到了58%的Rouge评分和72%的BERTScore-F1,表明它在生成准确和临床相关的医疗报告方面是有效的。这些发现表明,MediGen有可能显著减少医生的管理工作量,提高医疗效率和医生的幸福感。

[NLP-76] A Compressive Memory-based Retrieval Approach for Event Argument Extraction
[NLP-76] 一种基于压缩存储的事件参数提取方法

链接: https://arxiv.org/abs/2409.09322
作者: Wanlong Liu,Enqi Zhang,Li Zhou,Dingyi Zeng,Shaohuan Cheng,Chen Zhang,Malu Zhang,Wenyu Chen
关键词-EN: Event Argument Extraction, Argument Extraction, Event Argument, Recent works, retrieval-based EAE methods
关键词-ZH: 事件参数提取、参数提取、事件参数、最近的作品、基于检索的EAE方法
类目: Computation and Language (cs.CL)
备注: 15 pages

点击查看摘要

Abstract:Recent works have demonstrated the effectiveness of retrieval augmentation in the Event Argument Extraction (EAE) task. However, existing retrieval-based EAE methods have two main limitations: (1) input length constraints and (2) the gap between the retriever and the inference model. These issues limit the diversity and quality of the retrieved information. In this paper, we propose a Compressive Memory-based Retrieval (CMR) mechanism for EAE, which addresses the two limitations mentioned above. Our compressive memory, designed as a dynamic matrix that effectively caches retrieved information and supports continuous updates, overcomes the limitations of the input length. Additionally, after pre-loading all candidate demonstrations into the compressive memory, the model further retrieves and filters relevant information from memory based on the input query, bridging the gap between the retriever and the inference model. Extensive experiments show that our method achieves new state-of-the-art performance on three public datasets (RAMS, WikiEvents, ACE05), significantly outperforming existing retrieval-based EAE methods.
摘要:最近的研究证明提取增强在事件论元提取(EAE)任务中的有效性。然而,现有的基于检索的EAE方法有两个主要局限性:(1)输入长度限制;(2)检索者和推理模型之间的差距。这些问题限制了检索信息的多样性和质量。在本文中,我们提出了一种基于压缩记忆的EAE检索机制,解决了上述两个缺陷。我们的压缩存储器被设计为一个动态矩阵,有效地缓存检索的信息并支持连续更新,克服了输入长度的限制。此外,在将所有候选演示预加载到压缩存储器中之后,该模型还根据输入的查询从存储器中检索和过滤相关信息,从而弥合检索者和推理模型之间的差距。大量的实验表明,我们的方法在三个公共数据集(RAMS、WikiEvents、ACE05)上取得了最新的性能,显著优于现有的基于检索的EAE方法。

[NLP-77] ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models
[NLP-77] ODE:多模式大型语言模型中幻觉的开放集评估

链接: https://arxiv.org/abs/2409.09318
作者: Yahan Tu,Rui Hu,Jitao Sang
关键词-EN: multimodal large language, large language models, poses a significant, significant challenge, challenge for multimodal
关键词-ZH: 多模式大型语言、大型语言模型对多模式提出了一个重大的挑战
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hallucination poses a significant challenge for multimodal large language models (MLLMs). However, existing benchmarks for evaluating hallucinations are static, which can lead to potential data contamination. This paper introduces ODE, an open-set, dynamic protocol for evaluating object existence hallucinations in MLLMs. Our framework employs graph structures to model associations between real-word concepts and generates novel samples for both general and domain-specific scenarios. The dynamic combination of concepts, along with various combination principles, ensures a broad sample distribution. Experimental results show that MLLMs exhibit higher hallucination rates with ODE-generated samples, effectively avoiding data contamination. Moreover, these samples can also be used for fine-tuning to improve MLLM performance on existing benchmarks.
摘要:幻觉对多模式大型语言模型(MLLM)构成了重大挑战。然而,评估幻觉的现有基准是静态的,这可能会导致潜在的数据污染。本文介绍了ODE,这是一种开放集、动态协议,用于评估MLLM中的对象存在幻觉。我们的框架采用图结构来建模真实单词概念之间的关联,并为一般和特定领域场景生成新颖的样本。概念的动态组合以及各种组合原则确保了广泛的样本分布。实验结果表明,MLLM对ODE生成的样本表现出更高的幻觉率,有效地避免了数据污染。此外,这些样本还可用于微调,以提高现有基准测试的MLLM性能。

[NLP-78] Language Models “Grok” to Copy
[NLP-78] 要复制的语言模型“Grok”

链接: https://arxiv.org/abs/2409.09281
作者: Ang Lv,Ruobing Xie,Xingwu Sun,Zhanhui Kang,Rui Yan
关键词-EN: LLM applications, including in-context learning, Transformer-based language models, retrieval-augmented generation, copy text
关键词-ZH: LLM应用程序,包括上下文学习、基于Transformer的语言模型、检索增强生成、复制文本
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, 7 figures

点击查看摘要

Abstract:We examine the pre-training dynamics of language models, focusing on their ability to copy text from preceding context–a fundamental skill for various LLM applications, including in-context learning (ICL) and retrieval-augmented generation (RAG). We propose a novel perspective that Transformer-based language models develop copying abilities similarly to grokking, which refers to sudden generalization on test set long after the model fit to the training set. Our experiments yield three arguments: (1) The pre-training loss decreases rapidly, while the context copying ability of models initially lags and then abruptly saturates. (2) The speed of developing copying ability is independent of the number of tokens trained, similarly to how grokking speed is unaffected by dataset size as long as the data distribution is preserved. (3) Induction heads, the attention heads responsible for copying, form from shallow to deep layers during training, mirroring the development of circuits in deeper layers during grokking. We contend that the connection between grokking and context copying can provide valuable insights for more effective language model training, ultimately improving in-context performance. For example, we demonstrated that techniques that enhance grokking, such as regularization, either accelerate or enhance the development of context copying.
摘要:我们研究了语言模型的预训练动态,重点是它们从先前上下文复制文本的能力–这是各种LLM应用的基本技能,包括上下文中学习(ICL)和检索增强生成(RAG)。我们提出了一种新的观点,即基于Transformer的语言模型发展复制能力类似于摸索,指的是在模型适合于训练集很长一段时间后在测试集上突然泛化。我们的实验得出了三个论点:(1)训练前的损失迅速减少,而模型的上下文复制能力最初滞后,然后突然饱和。(2)复制能力的发展速度与训练的令牌数量无关,类似于只要数据分布保持不变,挖掘速度就不受数据集大小的影响。(3)感应头,即负责复制的注意头,在训练过程中从浅层向深层形成,反映了摸索过程中深层电路的发展。我们认为,摸索和语境复制之间的联系可以为更有效的语言模型训练提供有价值的见解,最终提高语境中的表现。例如,我们证明了促进摸索的技术,如规则化,要么加速或加强语境复制的发展。

[NLP-79] An empirical evaluation of using ChatGPT to summarize disputes for recommending similar labor and employment cases in Chinese
[NLP-79] 使用ChatGPT总结争议以推荐中文类似劳动和就业案例的实证评估

链接: https://arxiv.org/abs/2409.09280
作者: Po-Hsien Wu,Chao-Lin Liu,Wei-Jie Li
关键词-EN: recommending similar cases, disputes, employment litigations, mechanism for recommending, recommending similar
关键词-ZH: 推荐类似案件、纠纷、就业诉讼、推荐机制、推荐类似
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 5 figures, 2 tables, the 18th Int’l Workshop on Juris-Informatics (JURISIN 2024), associated with the 16th JSAI International Symposium on AI (JSAI-isAI 2024)

点击查看摘要

Abstract:We present a hybrid mechanism for recommending similar cases of labor and employment litigations. The classifier determines the similarity based on the itemized disputes of the two cases, that the courts prepared. We cluster the disputes, compute the cosine similarity between the disputes, and use the results as the features for the classification tasks. Experimental results indicate that this hybrid approach outperformed our previous system, which considered only the information about the clusters of the disputes. We replaced the disputes that were prepared by the courts with the itemized disputes that were generated by GPT-3.5 and GPT-4, and repeated the same experiments. Using the disputes generated by GPT-4 led to better results. Although our classifier did not perform as well when using the disputes that the ChatGPT generated, the results were satisfactory. Hence, we hope that the future large-language models will become practically useful.
摘要:我们提出了一种混合机制,用于推荐类似的劳动和就业诉讼案件。分类器根据法院准备的两个案件的详细争议来确定相似性。我们对争议进行聚集,计算争议之间的cos相似度,并将结果用作分类任务的特征。实验结果表明,这种混合方法优于我们之前的系统,该系统仅考虑有关争议集群的信息。我们用GPT-3.5和GPT-4生成的细目争议替换了法院准备的争议,并重复了相同的实验。使用GPT-4产生的争议可以获得更好的结果。尽管我们的分类器在使用ChatGPT生成的争议时表现不佳,但结果令人满意。因此,我们希望未来的大语言模型能够变得实用。

[NLP-80] Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks Domains and Knowledge Types
[NLP-80] 指导跨任务领域和知识类型的视觉任务响应的视觉语言模型选择

链接: https://arxiv.org/abs/2409.09269
作者: Neelabh Sinha,Vinija Jain,Aman Chadha
关键词-EN: aid user experience, Visual Question-Answering, achieving good results, user experience, zero-shot inference
关键词-ZH: 辅助用户体验、视觉操作、实现良好结果、用户体验、零镜头推理
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages + references + 6 pages of Appendix

点击查看摘要

Abstract:Visual Question-Answering (VQA) has become a key use-case in several applications to aid user experience, particularly after Vision-Language Models (VLMs) achieving good results in zero-shot inference. But evaluating different VLMs for an application requirement using a standardized framework in practical settings is still challenging. This paper introduces a comprehensive framework for evaluating VLMs tailored to VQA tasks in practical settings. We present a novel dataset derived from established VQA benchmarks, annotated with task types, application domains, and knowledge types, three key practical aspects on which tasks can vary. We also introduce GoEval, a multimodal evaluation metric developed using GPT-4o, achieving a correlation factor of 56.71% with human judgments. Our experiments with ten state-of-the-art VLMs reveals that no single model excelling universally, making appropriate selection a key design decision. Proprietary models such as Gemini-1.5-Pro and GPT-4o-mini generally outperform others, though open-source models like InternVL-2-8B and CogVLM-2-Llama-3-19B demonstrate competitive strengths in specific contexts, while providing additional advantages. This study guides the selection of VLMs based on specific task requirements and resource constraints, and can also be extended to other vision-language tasks.
摘要:视觉问答(VQA)已经成为一些应用程序中帮助用户体验的关键用例,特别是在视觉语言模型(VLMS)在零镜头推理中取得了良好的效果之后。但是,在实际环境中使用标准化框架评估不同的VLM以满足应用需求仍然是具有挑战性的。本文介绍了一种用于在实际环境中评估针对VQA任务量身定做的VLM的综合框架。我们提出了一个新的数据集,来自已建立的VQA基准,用任务类型、应用领域和知识类型进行标注,这三个关键的实践方面的任务可以有所不同。我们还引入了GoEval,这是一个使用GPT-40开发的多通道评估指标,与人类判断的相关系数达到56.71%。我们用10台最先进的超大规模集成电路进行的实验表明,没有哪一种机型是通用的,因此选择合适的机型是关键的设计决策。专有机型如Gemini-1.5-Pro和GPT-40-mini通常表现优于其他机型,而开源机型如InternVL-2-8B和CogVLM-2-Llama-3-19B在提供额外优势的同时,在特定环境中显示出竞争优势。本研究可以根据特定的任务要求和资源约束来指导VLM的选择,也可以推广到其他视觉语言任务。

[NLP-81] What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data Slicing
[NLP-81] 我的模型出了什么问题?通过语义数据切片识别系统问题

链接: https://arxiv.org/abs/2409.09261
作者: Chenyang Yang,Yining Hong,Grace A. Lewis,Tongshuang Wu,Christian Kästner
关键词-EN: Machine learning models, models make mistakes, Machine learning, learning models make, make mistakes
关键词-ZH: 机器学习模型,模型犯错误,机器学习,学习模型犯错误,犯错误
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Machine learning models make mistakes, yet sometimes it is difficult to identify the systematic problems behind the mistakes. Practitioners engage in various activities, including error analysis, testing, auditing, and red-teaming, to form hypotheses of what can go (or has gone) wrong with their models. To validate these hypotheses, practitioners employ data slicing to identify relevant examples. However, traditional data slicing is limited by available features and programmatic slicing functions. In this work, we propose SemSlicer, a framework that supports semantic data slicing, which identifies a semantically coherent slice, without the need for existing features. SemSlicer uses Large Language Models to annotate datasets and generate slices from any user-defined slicing criteria. We show that SemSlicer generates accurate slices with low cost, allows flexible trade-offs between different design dimensions, reliably identifies under-performing data slices, and helps practitioners identify useful data slices that reflect systematic problems.
摘要:机器学习模型会犯错误,但有时很难识别错误背后的系统性问题。实践者从事各种活动,包括错误分析、测试、审计和红团队,以形成关于他们的模型可能出错(或已经出错)的假设。为了验证这些假设,实践者使用数据切片来确定相关的例子。然而,传统的数据切片受到可用功能和编程切片功能的限制。在这项工作中,我们提出了SemSlicer,一个支持语义数据切片的框架,它在不需要现有特征的情况下识别出语义一致的切片。SemSlicer使用大型语言模型来注释数据集,并根据任何用户定义的切片标准生成切片。我们表明,SemSlicer以低成本生成准确的切片,允许在不同设计维度之间进行灵活的权衡,可靠地识别表现不佳的数据切片,并帮助从业者识别反映系统问题的有用数据切片。

[NLP-82] Analyzing Correlations Between Intrinsic and Extrinsic Bias Metrics of Static Word Embeddings With Their Measuring Biases Aligned
[NLP-82] 分析静态词嵌入的内在和外在偏差之间的相关性及其测量偏差对齐

链接: https://arxiv.org/abs/2409.09260
作者: Taisei Katô,Yusuke Miyao
关键词-EN: Natural Language Processing, Language Processing, Natural Language, bias metrics, extrinsic bias metrics
关键词-ZH: 自然语言处理、语言处理、自然语言、偏见指标、外在偏见指标
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We examine the abilities of intrinsic bias metrics of static word embeddings to predict whether Natural Language Processing (NLP) systems exhibit biased behavior. A word embedding is one of the fundamental NLP technologies that represents the meanings of words through real vectors, and problematically, it also learns social biases such as stereotypes. An intrinsic bias metric measures bias by examining a characteristic of vectors, while an extrinsic bias metric checks whether an NLP system trained with a word embedding is biased. A previous study found that a common intrinsic bias metric usually does not correlate with extrinsic bias metrics. However, the intrinsic and extrinsic bias metrics did not measure the same bias in most cases, which makes us question whether the lack of correlation is genuine. In this paper, we extract characteristic words from datasets of extrinsic bias metrics and analyze correlations with intrinsic bias metrics with those words to ensure both metrics measure the same bias. We observed moderate to high correlations with some extrinsic bias metrics but little to no correlations with the others. This result suggests that intrinsic bias metrics can predict biased behavior in particular settings but not in others. Experiment codes are available at GitHub.
摘要:我们考察了静态单词嵌入的内在偏差度量对自然语言处理(NLP)系统是否表现出偏见行为的预测能力。单词嵌入是一种基本的自然语言处理技术,它通过实数向量来表示单词的含义,而且有问题的是,它还学习了社会偏见,如刻板印象。内部偏差度量通过检查向量的特征来衡量偏差,而外部偏差度量检查用词嵌入训练的NLP系统是否有偏差。先前的一项研究发现,共同的内在偏向指标通常与外在偏向指标无关。然而,在大多数情况下,内在和外在的偏差度量并没有测量到相同的偏差,这让我们质疑相关性的缺乏是否是真实的。在本文中,我们从外在偏向度量的数据集中提取特征词,并分析与内在偏向度量的相关性,以确保两个度量度量相同的偏向。我们观察到与一些外在偏向指标的中度到高度的相关性,但与其他的相关性很小或没有相关性。这一结果表明,内在偏差度量可以预测特定环境中的偏见行为,而不是其他环境中的偏见行为。实验代码可以在GitHub上找到。

[NLP-83] Unleash LLMs Potential for Recommendation by Coordinating Twin-Tower Dynamic Semantic Token Generator
[NLP-83] 通过协调双塔动态语义令牌生成器释放LLM的推荐潜力

链接: https://arxiv.org/abs/2409.09253
作者: Jun Yin,Zhengxin Zeng,Mingzheng Li,Hao Yan,Chaozhuo Li,Weihao Han,Jianjin Zhang,Ruochen Liu,Allen Sun,Denvy Deng,Feng Sun,Qi Zhang,Shirui Pan,Senzhang Wang
关键词-EN: large language models, pre-trained large language, shown fantastic potential, next-generation recommender systems, semantic index
关键词-ZH: 大型语言模型、预训练的大型语言、显示出巨大的潜力、下一代推荐系统、语义索引
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Owing to the unprecedented capability in semantic understanding and logical reasoning, the pre-trained large language models (LLMs) have shown fantastic potential in developing the next-generation recommender systems (RSs). However, the static index paradigm adopted by current methods greatly restricts the utilization of LLMs capacity for recommendation, leading to not only the insufficient alignment between semantic and collaborative knowledge, but also the neglect of high-order user-item interaction patterns. In this paper, we propose Twin-Tower Dynamic Semantic Recommender (TTDS), the first generative RS which adopts dynamic semantic index paradigm, targeting at resolving the above problems simultaneously. To be more specific, we for the first time contrive a dynamic knowledge fusion framework which integrates a twin-tower semantic token generator into the LLM-based recommender, hierarchically allocating meaningful semantic index for items and users, and accordingly predicting the semantic index of target item. Furthermore, a dual-modality variational auto-encoder is proposed to facilitate multi-grained alignment between semantic and collaborative knowledge. Eventually, a series of novel tuning tasks specially customized for capturing high-order user-item interaction patterns are proposed to take advantages of user historical behavior. Extensive experiments across three public datasets demonstrate the superiority of the proposed methodology in developing LLM-based generative RSs. The proposed TTDS recommender achieves an average improvement of 19.41% in Hit-Rate and 20.84% in NDCG metric, compared with the leading baseline methods.
摘要:预先训练好的大语言模型具有前所未有的语义理解和逻辑推理能力,在开发下一代推荐系统(RSS)方面显示出巨大的潜力。然而,现有方法所采用的静态索引范式极大地限制了LLMS推荐能力的利用,不仅导致语义知识和协作知识之间的匹配不足,而且忽视了高阶用户-项目交互模式。针对上述问题,本文提出了双塔动态语义推荐系统(Twin-Tower Dynamic语义Recommender,TTDS),它是第一个采用动态语义索引范式的生成式语义推荐系统。具体地说,我们首次设计了一个动态知识融合框架,在基于LLM的推荐器中集成了一个双塔语义令牌生成器,分层地为项目和用户分配有意义的语义索引,并相应地预测目标条目的语义索引。此外,还提出了一种双通道变分自动编码器,以促进语义知识和协作知识之间的多粒度对齐。最后,为了充分利用用户的历史行为,提出了一系列针对高阶用户-项目交互模式的个性化调整任务。在三个公共数据集上的大量实验证明了该方法在开发基于LLM的产生式RSS方面的优越性。与主流的基准方法相比,TTDS推荐方法的平均命中率提高了19.41%,NDCG度量平均提高了20.84%。

[NLP-84] NovAScore: A New Automated Metric for Evaluating Document Level Novelty
[NLP-84] NovAScore:评估文档级新颖性的新自动化指标

链接: https://arxiv.org/abs/2409.09249
作者: Lin Ai,Ziwei Gong,Harshsaiprasad Deshpande,Alexander Johnson,Emmy Phung,Ahmad Emami,Julia Hirschberg
关键词-EN: rapid expansion, expansion of online, online content, content has intensified, intensified the issue
关键词-ZH: 快速扩张,线上扩张,线上内容、内容加剧,问题愈演愈烈
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid expansion of online content has intensified the issue of information redundancy, underscoring the need for solutions that can identify genuinely new information. Despite this challenge, the research community has seen a decline in focus on novelty detection, particularly with the rise of large language models (LLMs). Additionally, previous approaches have relied heavily on human annotation, which is time-consuming, costly, and particularly challenging when annotators must compare a target document against a vast number of historical documents. In this work, we introduce NovAScore (Novelty Evaluation in Atomicity Score), an automated metric for evaluating document-level novelty. NovAScore aggregates the novelty and salience scores of atomic information, providing high interpretability and a detailed analysis of a document’s novelty. With its dynamic weight adjustment scheme, NovAScore offers enhanced flexibility and an additional dimension to assess both the novelty level and the importance of information within a document. Our experiments show that NovAScore strongly correlates with human judgments of novelty, achieving a 0.626 Point-Biserial correlation on the TAP-DLND 1.0 dataset and a 0.920 Pearson correlation on an internal human-annotated dataset.
摘要:在线内容的快速扩张加剧了信息冗余的问题,突显了需要能够识别真正新信息的解决方案。尽管存在这一挑战,但研究界已经看到,人们对新颖性检测的关注有所下降,特别是随着大型语言模型(LLM)的兴起。此外,以前的方法严重依赖人工标注,当标注人员必须将目标文档与大量历史文档进行比较时,人工标注既耗时又昂贵,尤其具有挑战性。在这项工作中,我们介绍了NovAScore(原子性评分中的新颖性评估),这是一个用于评估文档级新颖性的自动化度量。NovAScore汇总了原子信息的新颖性和显着性分数,提供了高度的可解释性和对文档新颖性的详细分析。通过其动态权重调整计划,NovAScore提供了增强的灵活性和额外的维度,以评估文件中信息的新颖性水平和重要性。我们的实验表明,NovAScore与人类对新颖性的判断有很强的相关性,在TAP-DLND1.0数据集上实现了0.626的点-双序相关性,在内部人类注释的数据集上实现了0.920的皮尔逊相关性。

[NLP-85] Robust Training of Neural Networks at Arbitrary Precision and Sparsity
[NLP-85] 任意精度和稀疏性下的神经网络鲁棒训练

链接: https://arxiv.org/abs/2409.09245
作者: Chengxi Ye,Grace Chu,Yanfeng Liu,Yichi Zhang,Lukasz Lew,Andrew Howard
关键词-EN: sparsification introduce obstacles, discontinuous operations inherent, obstacles to backpropagation, discontinuous operations, introduce obstacles
关键词-ZH: 稀疏化引入障碍,固有的不连续操作,反向传播的障碍,不连续操作,引入障碍
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:The discontinuous operations inherent in quantization and sparsification introduce obstacles to backpropagation. This is particularly challenging when training deep neural networks in ultra-low precision and sparse regimes. We propose a novel, robust, and universal solution: a denoising affine transform that stabilizes training under these challenging conditions. By formulating quantization and sparsification as perturbations during training, we derive a perturbation-resilient approach based on ridge regression. Our solution employs a piecewise constant backbone model to ensure a performance lower bound and features an inherent noise reduction mechanism to mitigate perturbation-induced corruption. This formulation allows existing models to be trained at arbitrarily low precision and sparsity levels with off-the-shelf recipes. Furthermore, our method provides a novel perspective on training temporal binary neural networks, contributing to ongoing efforts to narrow the gap between artificial and biological neural networks.
摘要:量化和稀疏化所固有的不连续运算给反向传播带来了障碍。当在超低精度和稀疏状态下训练深度神经网络时,这尤其具有挑战性。我们提出了一种新颖的、健壮的和通用的解决方案:一种去噪仿射变换,它在这些具有挑战性的条件下稳定训练。通过将量化和稀疏化表示为训练过程中的扰动,我们得到了一种基于岭回归的扰动恢复方法。我们的解决方案使用分段恒定主干模型来确保性能下限,并采用固有的降噪机制来缓解扰动导致的损坏。这种公式允许用现成的食谱以任意低的精度和稀疏性水平训练现有的模型。此外,我们的方法为训练时间二进制神经网络提供了一个新的视角,有助于缩小人工神经网络和生物神经网络之间的差距。

[NLP-86] Autoregressive Chain of Thought (CoT) simeq Recurrent: Recurrences Role in Language Models and a Revist of Recurrent Transformer
[NLP-86] 自回归思维链(CoT)simeq Recurrent:语言模型中的回归角色和回归Transformer器的修订

链接: https://arxiv.org/abs/2409.09239
作者: Xiang Zhang,Muhammad Abdul-Mageed,Laks V.S. Lakshmanan
关键词-EN: RNN and LSTM, Transformer architecture excels, outperforming traditional neural, outperforming traditional, Transformer
关键词-ZH: RNN和LSTM,Transformer架构表现出色,优于传统神经,优于传统,Transformer
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Transformer architecture excels in a variety of language modeling tasks, outperforming traditional neural architectures such as RNN and LSTM. This is partially due to its elimination of recurrent connections, which allows for parallel training and a smoother flow of gradients. However, this move away from recurrent structures places the Transformer model at the lower end of Chomsky’s computational hierarchy, imposing limitations on its computational abilities. Consequently, even advanced Transformer-based models face considerable difficulties in tasks like counting, string reversal, bracket pairing, and multiplication. These tasks, though seemingly elementary, require a level of computational complexity that exceeds the capabilities of the Transformer architecture. Concurrently, the emergence of Chain of Thought" (CoT) prompting has enabled Transformer-based language models to tackle tasks that were previously impossible or poorly executed. Despite some previous research primarily interpreting CoT from a psychological perspective, a comprehensive understanding of \textitwhy CoT proves so effective in the reasoning process remains elusive. In this work, we thoroughly investigate the influence of recurrent structures in language models on their reasoning abilities, shedding light on how the CoT approach can mimic recurrent computation and act as a bridge between autoregression and recurrence. It is this approximated recurrence that notably improves the model's performance and computational capacity. Moreover, we revisit recent recurrent-based Transformer model designs, focusing on their computational abilities through our proposed concept of recurrence-completeness" and identify key theoretical limitations in models like Linear Transformer and RWKV. Through this, we aim to provide insight into the neural model architectures and prompt better model design.
摘要:Transformer体系结构在各种语言建模任务中表现出色,优于RNN和LSTM等传统神经体系结构。这在一定程度上是因为它消除了重复连接,允许并行训练和更顺畅的梯度流动。然而,这种远离递归结构的做法将变形金刚模型置于乔姆斯基计算层次结构的低端,对其计算能力施加了限制。因此,即使是先进的基于变形金刚的模型在计数、串反转、括号配对和乘法等任务中也面临着相当大的困难。这些任务虽然看似基本,但需要的计算复杂程度超出了Transformer体系结构的能力。同时,“思维链”(COT)提示的出现使基于Transformer的语言模型能够处理以前不可能或执行得很差的任务。尽管以前的一些研究主要从心理学的角度来解释COT,但对于COT在推理过程中被证明如此有效的原因的全面理解仍然是难以捉摸的。在这项工作中,我们深入研究了语言模型中递归结构对其推理能力的影响,揭示了COT方法如何模拟递归计算,并在自回归和递归之间起到桥梁作用。正是这种近似递推显著地提高了模型的性能和计算能力。此外,我们回顾了最近的基于递归的变压器模型设计,通过我们提出的“递归-完备性”的概念来关注它们的计算能力,并确定了线性变压器和RWKV等模型中的关键理论限制。通过这一点,我们的目标是提供对神经模型体系结构的洞察,并促进更好的模型设计。

[NLP-87] Multi-modal Speech Transformer Decoders: When Do Multiple Modalities Improve Accuracy?
[NLP-87] 多模式语音Transformer解码器:多模式何时可以提高准确性?

链接: https://arxiv.org/abs/2409.09221
作者: Yiwen Guan,Viet Anh Trinh,Vivek Voleti,Jacob Whitehill
关键词-EN: Decoder-only discrete-token language, discrete-token language models, recently achieved significant, achieved significant success, Decoder-only discrete-token
关键词-ZH: 仅解码器的离散令牌语言,离散令牌语言模型,最近取得了重大成果,取得了重大成功,仅解码器的离散令牌
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Decoder-only discrete-token language models have recently achieved significant success in automatic speech recognition. However, systematic analyses of how different modalities impact performance in specific scenarios remain limited. In this paper, we investigate the effects of multiple modalities on recognition accuracy on both synthetic and real-world datasets. Our experiments suggest that: (1) Integrating more modalities can increase accuracy; in particular, our paper is, to our best knowledge, the first to show the benefit of combining audio, image context, and lip information; (2) Images as a supplementary modality for speech recognition provide the greatest benefit at moderate noise levels, moreover, they exhibit a different trend compared to inherently synchronized modalities like lip movements; (3) Performance improves on both synthetic and real-world datasets when the most relevant visual information is filtered as a preprocessing step.
摘要:仅解码器的离散令牌语言模型最近在自动语音识别方面取得了重大成功。然而,对不同模式在特定场景下如何影响性能的系统分析仍然有限。在本文中,我们研究了多种模式对合成和现实世界数据集识别准确性的影响。我们的实验表明:(1)集成更多的模式可以提高准确性;特别是,据我们所知,我们的论文是第一篇展示了结合音频、图像上下文和嘴唇信息的好处的论文;(2)图像作为语音识别的补充模式在中等噪音水平下提供了最大的好处,此外,与嘴唇运动等固有同步的模式相比,它们表现出不同的趋势;(3)当过滤最相关的视觉信息作为预处理步骤时,合成和现实世界数据集的性能都会得到改善。

[NLP-88] Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases
[NLP-88] 用于分类热带和传染病的大型语言模型的上下文评估

链接: https://arxiv.org/abs/2409.09201
作者: Mercy Asiedu,Nenad Tomasev,Chintan Ghate,Tiya Tiyasirichokchai,Awa Dieng,Oluwatosin Akande,Geoffrey Siwo,Steve Adudans,Sylvanus Aitkins,Odianosen Ehiakhamen,Katherine Heller
关键词-EN: large language models, limited work focused, infectious disease-specific exploration, medical question answering, language models
关键词-ZH: 大型语言模型、有限的工作重点、传染病特定探索、医学问题回答、语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While large language models (LLMs) have shown promise for medical question answering, there is limited work focused on tropical and infectious disease-specific exploration. We build on an opensource tropical and infectious diseases (TRINDs) dataset, expanding it to include demographic and semantic clinical and consumer augmentations yielding 11000+ prompts. We evaluate LLM performance on these, comparing generalist and medical LLMs, as well as LLM outcomes to human experts. We demonstrate through systematic experimentation, the benefit of contextual information such as demographics, location, gender, risk factors for optimal LLM response. Finally we develop a prototype of TRINDs-LM, a research tool that provides a playground to navigate how context impacts LLM outputs for health.
摘要:虽然大型语言模型(LLM)在医学问题回答方面表现出了希望,但专注于热带和传染病特定探索的工作有限。我们建立在开源热带和传染病(TRIND)数据集的基础上,将其扩展到包括人口统计和语义临床和消费者增强,产生11000多个提示。我们评估LLM在这些方面的表现,将多面手和医学LLM以及LLM结果与人类专家进行比较。我们通过系统实验证明了人口统计、地点、性别、风险因素等背景信息对于最佳LLM响应的好处。最后,我们开发了TRINDs-LM的原型,这是一个研究工具,可以提供一个平台来探索上下文如何影响LLM健康输出。

[NLP-89] ransformer with Controlled Attention for Synchronous Motion Captioning
[NLP-89] 同步运动字幕的具有受控注意力的转换器

链接: https://arxiv.org/abs/2409.09177
作者: Karim Radouane,Sylvie Ranwez,Julien Lagarde,Andon Tchechmedjiev
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Translation interface exception

[NLP-90] owards Precision Characterization of Communication Disorders using Models of Perceived Pragmatic Similarity ICASSP2025
[NLP-90] owards使用感知的言语相似性模型精确描述沟通障碍

链接: https://arxiv.org/abs/2409.09170
作者: Nigel G. Ward,Andres Segura,Georgina Bugarini,Heike Lehnert-LeHouillier,Dancheng Liu,Jinjun Xiong,Olac Fuentes
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL)
备注: submitted to IEEE ICASSP 2025

点击查看摘要

Translation interface exception

[NLP-91] DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and URLs Detection and Classification
[NLP-91] DomURL_BERT:用于恶意域名和URL检测和分类的预训练的基于BERT的模型

链接: https://arxiv.org/abs/2409.09143
作者: Abdelkader El Mahdaouy,Salima Lamsiyah,Meryem Janati Idrissi,Hamza Alami,Zakaria Yartaoui,Ismail Berrada
关键词-EN: malicious domains, Domain Generation Algorithms, URLs, BERT, detect malicious domains
关键词-ZH: 恶意域、域生成算法、URL、BERT、检测恶意域
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Detecting and classifying suspicious or malicious domain names and URLs is fundamental task in cybersecurity. To leverage such indicators of compromise, cybersecurity vendors and practitioners often maintain and update blacklists of known malicious domains and URLs. However, blacklists frequently fail to identify emerging and obfuscated threats. Over the past few decades, there has been significant interest in developing machine learning models that automatically detect malicious domains and URLs, addressing the limitations of blacklists maintenance and updates. In this paper, we introduce DomURLs_BERT, a pre-trained BERT-based encoder adapted for detecting and classifying suspicious/malicious domains and URLs. DomURLs_BERT is pre-trained using the Masked Language Modeling (MLM) objective on a large multilingual corpus of URLs, domain names, and Domain Generation Algorithms (DGA) dataset. In order to assess the performance of DomURLs_BERT, we have conducted experiments on several binary and multi-class classification tasks involving domain names and URLs, covering phishing, malware, DGA, and DNS tunneling. The evaluations results show that the proposed encoder outperforms state-of-the-art character-based deep learning models and cybersecurity-focused BERT models across multiple tasks and datasets. The pre-training dataset, the pre-trained DomURLs_BERT encoder, and the experiments source code are publicly available.
摘要:对可疑或恶意的域名和URL进行检测和分类是网络安全的基本任务。为了利用这些危害指标,网络安全供应商和从业者经常维护和更新已知恶意域和URL的黑名单。然而,黑名单往往无法识别新出现的和模糊的威胁。在过去的几十年里,人们对开发机器学习模型非常感兴趣,该模型可以自动检测恶意域和URL,解决黑名单维护和更新的限制。本文介绍了DomURLS_BERT,这是一种基于BERT的预先训练的编码器,适用于检测和分类可疑/恶意的域和URL。DomURLS_BERT是使用掩蔽语言建模(MLM)目标在URL、域名和域生成算法(DGA)数据集的大型多语言语料库上进行预训练的。为了评估DomURLS_BERT的性能,我们对几个涉及域名和URL的二进制和多类分类任务进行了实验,包括钓鱼、恶意软件、DGA和DNS隧道。评估结果表明,该编码器在多个任务和数据集上的性能优于最新的基于特征的深度学习模型和关注网络安全的BERT模型。训练前的数据集、训练前的DomURLS_BERT编码器和实验源代码都是公开可用的。

[NLP-92] Multimodal Fusion with LLMs for Engagement Prediction in Natural Conversation
[NLP-92] 与LLM的多模式融合用于自然对话中的参与度预测

链接: https://arxiv.org/abs/2409.09135
作者: Cheng Charles Ma,Kevin Hyekang Joo,Alexandria K. Vail,Sunreeta Bhattacharya,Álvaro Fernández García,Kailana Baker-Matsuoka,Sheryl Mathew,Lori L. Holt,Fernando De la Torre
关键词-EN:
关键词-ZH: Translation interface exception
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 22 pages, first three authors equal contribution

点击查看摘要

Translation interface exception

[NLP-93] AccentBox: Towards High-Fidelity Zero-Shot Accent Generation
[NLP-93] AccentBox:迈向高保真零镜头口音一代

链接: https://arxiv.org/abs/2409.09098
作者: Jinzuomu Zhong,Korin Richmond,Zhiba Su,Siqi Sun
关键词-EN:
关键词-ZH: Translation interface exception
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Translation interface exception

[NLP-94] An Evaluation of GPT-4V for Transcribing the Urban Renewal Hand-Written Collection
[NLP-94] GPT-4V对城市更新手抄本的评价

链接: https://arxiv.org/abs/2409.09090
作者: Myeong Lee,Julia H.P. Hsu
关键词-EN:
关键词-ZH: Translation interface exception
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注: Published in Digital Humanities (DH 2024). Aug 6-9. Arlington, VA

点击查看摘要

Translation interface exception

[NLP-95] United in Diversity? Contextual Biases in LLM-Based Predictions of the 2024 European Parliament Elections
[NLP-95] 团结多样性?基于LLM的2024年欧洲议会选举预测的背景偏差

链接: https://arxiv.org/abs/2409.09045
作者: Leah von der Heyde,Anna-Carolina Haensch,Alexander Wenz
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Applications (stat.AP)
备注:

点击查看摘要

Translation interface exception

[NLP-96] Acceptable Use Policies for Foundation Models
[NLP-96] 基础模型的可接受使用政策

链接: https://arxiv.org/abs/2409.09041
作者: Kevin Klyman
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 2 figures, 2 tables

点击查看摘要

Translation interface exception

[NLP-97] ChatSUMO: Large Language Model for Automating Traffic Scenario Generation in Simulation of Urban MObility
[NLP-97] ChatSUMO:城市移动性模拟中自动生成交通场景的大型语言模型

链接: https://arxiv.org/abs/2409.09040
作者: Shuyang Li,Talha Azfar,Ruimin Ke
关键词-EN:
关键词-ZH: Translation interface exception
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Translation interface exception

[NLP-98] Exploring Diverse Methods in Visual Question Answering
[NLP-98] 探索视觉问题解答的多种方法

链接: https://arxiv.org/abs/2404.13565
作者: Panfeng Li,Qikai Yang,Xieming Geng,Wenjing Zhou,Zhicheng Ding,Yi Nian
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by 2024 5th International Conference on Electronic Communication and Artificial Intelligence

点击查看摘要

Translation interface exception

[NLP-99] An Efficient Self-Learning Framework For Interactive Spoken Dialog Systems ICML2024
[NLP-99] 交互式口语对话系统的高效自学习框架

链接: https://arxiv.org/abs/2409.10515
作者: Hitesh Tulsiani,David M. Chan,Shalini Ghosh,Garima Lalwani,Prabhat Pandey,Ankish Bansal,Sri Garimella,Ariya Rastrow,Björn Hoffmeister
关键词-EN:
关键词-ZH: Translation interface exception
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: Presented at ICML 2024

点击查看摘要

Translation interface exception

[NLP-100] Meta-Whisper: Speech-Based Meta-ICL for ASR on Low-Resource Languages
[NLP-100] Meta-Whisper:基于语音的Meta-ICL,用于低资源语言上的ASB

链接: https://arxiv.org/abs/2409.10429
作者: Ming-Hao Hsu,Kuan Po Huang,Hung-yi Lee
关键词-EN:
关键词-ZH: Translation interface exception
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Translation interface exception

[NLP-101] ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds
[NLP-101] ReCLAP:通过描述声音改进零镜头音频分类

链接: https://arxiv.org/abs/2409.09213
作者: Sreyan Ghosh,Sonal Kumar,Chandra Kiran Reddy Evuru,Oriol Nieto,Ramani Duraiswami,Dinesh Manocha
关键词-EN:
关键词-ZH: Translation interface exception
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Code and Checkpoints: this https URL

点击查看摘要

Translation interface exception

[NLP-102] Harnessing Earnings Reports for Stock Predictions: A QLoRA-Enhanced LLM Approach
[NLP-102] 利用收益报告进行股票预测:QLoRA增强的LLM方法

链接: https://arxiv.org/abs/2408.06634
作者: Haowei Ni,Shuchen Meng,Xupeng Chen,Ziqing Zhao,Andi Chen,Panfeng Li,Shiyao Zhang,Qifu Yin,Yuanqing Wang,Yuxi Chan
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Statistical Finance (q-fin.ST)
备注: Accepted by 2024 6th International Conference on Data-driven Optimization of Complex Systems

点击查看摘要

Translation interface exception

人工智能

[AI-0] MusicLIME: Explainable Multimodal Music Understanding

链接: https://arxiv.org/abs/2409.10496
作者: Theodoros Sotirou,Vassilis Lyberatos,Orfeas Menis Mastromichalakis,Giorgos Stamou
关键词-EN: capture the complex, complex interplay, music understanding tasks, multimodal music models, audio and lyrics
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: GitHub repository: this https URL

点击查看摘要

Abstract:Multimodal models are critical for music understanding tasks, as they capture the complex interplay between audio and lyrics. However, as these models become more prevalent, the need for explainability grows-understanding how these systems make decisions is vital for ensuring fairness, reducing bias, and fostering trust. In this paper, we introduce MusicLIME, a model-agnostic feature importance explanation method designed for multimodal music models. Unlike traditional unimodal methods, which analyze each modality separately without considering the interaction between them, often leading to incomplete or misleading explanations, MusicLIME reveals how audio and lyrical features interact and contribute to predictions, providing a holistic view of the model’s decision-making. Additionally, we enhance local explanations by aggregating them into global explanations, giving users a broader perspective of model behavior. Through this work, we contribute to improving the interpretability of multimodal music models, empowering users to make informed choices, and fostering more equitable, fair, and transparent music understanding systems.

[AI-1] Flash STU: Fast Spectral Transform Units

链接: https://arxiv.org/abs/2409.10489
作者: Y. Isabel Liu,Windsor Nguyen,Yagiz Devre,Evan Dogariu,Anirudha Majumdar,Elad Hazan
关键词-EN: Spectral Transform Unit, open source PyTorch, source PyTorch implementation, Transform Unit, Spectral Transform
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper describes an efficient, open source PyTorch implementation of the Spectral Transform Unit. We investigate sequence prediction tasks over several modalities including language, robotics, and simulated dynamical systems. We find that for the same parameter count, the STU and its variants outperform the Transformer as well as other leading state space models across various modalities.

[AI-2] Do Pre-trained Vision-Language Models Encode Object States?

链接: https://arxiv.org/abs/2409.10488
作者: Kaleb Newman,Shijie Wang,Yuan Zang,David Heffren,Chen Sun
关键词-EN: sliced apple, evolve over time, capture the temporal, temporal dynamics, encode object states
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:For a vision-language model (VLM) to understand the physical world, such as cause and effect, a first step is to capture the temporal dynamics of the visual world, for example how the physical states of objects evolve over time (e.g. a whole apple into a sliced apple). Our paper aims to investigate if VLMs pre-trained on web-scale data learn to encode object states, which can be extracted with zero-shot text prompts. We curate an object state recognition dataset ChangeIt-Frames, and evaluate nine open-source VLMs, including models trained with contrastive and generative objectives. We observe that while these state-of-the-art vision-language models can reliably perform object recognition, they consistently fail to accurately distinguish the objects’ physical states. Through extensive experiments, we identify three areas for improvements for VLMs to better encode object states, namely the quality of object localization, the architecture to bind concepts to objects, and the objective to learn discriminative visual and language encoders on object states. Data and code are released.

[AI-3] Exploring 3D Face Reconstruction and Fusion Methods for Face Verification: A Case-Study in Video Surveillance ECCV2024

链接: https://arxiv.org/abs/2409.10481
作者: Simone Maurizio La Cava,Sara Concas,Ruben Tolosana,Roberto Casula,Giulia Orrù,Martin Drahansky,Julian Fierrez,Gian Luca Marcialis
关键词-EN: specific assumptions tailored, distinct application scenarios, based on specific, tailored to distinct, specific assumptions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at T-CAP - Towards a Complete Analysis of People: Fine-grained Understanding for Real-World Applications, workshop in conjunction with the 18th European Conference on Computer Vision ECCV 2024

点击查看摘要

Abstract:3D face reconstruction (3DFR) algorithms are based on specific assumptions tailored to distinct application scenarios. These assumptions limit their use when acquisition conditions, such as the subject’s distance from the camera or the camera’s characteristics, are different than expected, as typically happens in video surveillance. Additionally, 3DFR algorithms follow various strategies to address the reconstruction of a 3D shape from 2D data, such as statistical model fitting, photometric stereo, or deep learning. In the present study, we explore the application of three 3DFR algorithms representative of the SOTA, employing each one as the template set generator for a face verification system. The scores provided by each system are combined by score-level fusion. We show that the complementarity induced by different 3DFR algorithms improves performance when tests are conducted at never-seen-before distances from the camera and camera characteristics (cross-distance and cross-camera settings), thus encouraging further investigations on multiple 3DFR-based approaches.

[AI-4] MacDiff: Unified Skeleton Modeling with Masked Conditional Diffusion ECCV2024

链接: https://arxiv.org/abs/2409.10473
作者: Lehong Wu,Lilang Lin,Jiahang Zhang,Yiyang Ma,Jiaying Liu
关键词-EN: human action understanding, skeleton-based human action, Self-supervised learning, action understanding, learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:Self-supervised learning has proved effective for skeleton-based human action understanding. However, previous works either rely on contrastive learning that suffers false negative problems or are based on reconstruction that learns too much unessential low-level clues, leading to limited representations for downstream tasks. Recently, great advances have been made in generative learning, which is naturally a challenging yet meaningful pretext task to model the general underlying data distributions. However, the representation learning capacity of generative models is under-explored, especially for the skeletons with spacial sparsity and temporal redundancy. To this end, we propose Masked Conditional Diffusion (MacDiff) as a unified framework for human skeleton modeling. For the first time, we leverage diffusion models as effective skeleton representation learners. Specifically, we train a diffusion decoder conditioned on the representations extracted by a semantic encoder. Random masking is applied to encoder inputs to introduce a information bottleneck and remove redundancy of skeletons. Furthermore, we theoretically demonstrate that our generative objective involves the contrastive learning objective which aligns the masked and noisy views. Meanwhile, it also enforces the representation to complement for the noisy view, leading to better generalization performance. MacDiff achieves state-of-the-art performance on representation learning benchmarks while maintaining the competence for generative tasks. Moreover, we leverage the diffusion model for data augmentation, significantly enhancing the fine-tuning performance in scenarios with scarce labeled data. Our project is available at this https URL.

[AI-5] HiFi-CS: Towards Open Vocabulary Visual Grounding For Robotic Grasping Using Vision-Language Models

链接: https://arxiv.org/abs/2409.10419
作者: Vineet Bhat,Prashanth Krishnamurthy,Ramesh Karri,Farshad Khorrami
关键词-EN: Referring Grasp Synthesis, Grasp Synthesis, Referring Grasp, unlock numerous applications, Robots interacting
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Robots interacting with humans through natural language can unlock numerous applications such as Referring Grasp Synthesis (RGS). Given a text query, RGS determines a stable grasp pose to manipulate the referred object in the robot’s workspace. RGS comprises two steps: visual grounding and grasp pose estimation. Recent studies leverage powerful Vision-Language Models (VLMs) for visually grounding free-flowing natural language in real-world robotic execution. However, comparisons in complex, cluttered environments with multiple instances of the same object are lacking. This paper introduces HiFi-CS, featuring hierarchical application of Featurewise Linear Modulation (FiLM) to fuse image and text embeddings, enhancing visual grounding for complex attribute rich text queries encountered in robotic grasping. Visual grounding associates an object in 2D/3D space with natural language input and is studied in two scenarios: Closed and Open Vocabulary. HiFi-CS features a lightweight decoder combined with a frozen VLM and outperforms competitive baselines in closed vocabulary settings while being 100x smaller in size. Our model can effectively guide open-set object detectors like GroundedSAM to enhance open-vocabulary performance. We validate our approach through real-world RGS experiments using a 7-DOF robotic arm, achieving 90.33% visual grounding accuracy in 15 tabletop scenes. We include our codebase in the supplementary material.

[AI-6] A Knowledge-Enhanced Disease Diagnosis Method Based on Prompt Learning and BERT Integration

链接: https://arxiv.org/abs/2409.10403
作者: Zhang Zheng
关键词-EN: prompt learning framework, diagnosis method based, knowledge-enhanced disease diagnosis, learning framework, paper proposes
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Knowledge Enhancement,Disease Diagnosis,Prompt Learning,BERT,Knowledge Graph

点击查看摘要

Abstract:This paper proposes a knowledge-enhanced disease diagnosis method based on a prompt learning framework. The method retrieves structured knowledge from external knowledge graphs related to clinical cases, encodes it, and injects it into the prompt templates to enhance the language model’s understanding and reasoning capabilities for the task.We conducted experiments on three public datasets: CHIP-CTC, IMCS-V2-NER, and KUAKE-QTR. The results show that the proposed method significantly outperforms existing models across multiple evaluation metrics, with an F1 score improvement of 2.4% on the CHIP-CTC dataset, 3.1% on the IMCS-V2-NER dataset,and 4.2% on the KUAKE-QTR dataset. Additionally,ablation studies confirmed the critical role of the knowledge injection module,as the removal of this module resulted in a significant drop in F1 score. The experimental results demonstrate that the proposed method not only effectively improves the accuracy of disease diagnosis but also enhances the interpretability of the predictions, providing more reliable support and evidence for clinical diagnosis.

[AI-7] Instigating Cooperation among LLM Agents Using Adaptive Information Modulation

链接: https://arxiv.org/abs/2409.10372
作者: Qiliang Chen,Alireza(Sepehr)Ilami,Nunzio Lore,Babak Heydari
关键词-EN: combining LLM agents, framework combining LLM, evolving strategic interactions, combining LLM, human strategic behavior
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:This paper introduces a novel framework combining LLM agents as proxies for human strategic behavior with reinforcement learning (RL) to engage these agents in evolving strategic interactions within team environments. Our approach extends traditional agent-based simulations by using strategic LLM agents (SLA) and introducing dynamic and adaptive governance through a pro-social promoting RL agent (PPA) that modulates information access across agents in a network, optimizing social welfare and promoting pro-social behavior. Through validation in iterative games, including the prisoner dilemma, we demonstrate that SLA agents exhibit nuanced strategic adaptations. The PPA agent effectively learns to adjust information transparency, resulting in enhanced cooperation rates. This framework offers significant insights into AI-mediated social dynamics, contributing to the deployment of AI in real-world team settings.

[AI-8] Robust image representations with counterfactual contrastive learning

链接: https://arxiv.org/abs/2409.10365
作者: Mélanie Roschewitz,Fabio De Sousa Ribeiro,Tian Xia,Galvin Khara,Ben Glocker
关键词-EN: increase model generalisation, Contrastive, contrastive learning, counterfactual contrastive learning, counterfactual contrastive
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Code available at this https URL

点击查看摘要

Abstract:Contrastive pretraining can substantially increase model generalisation and downstream performance. However, the quality of the learned representations is highly dependent on the data augmentation strategy applied to generate positive pairs. Positive contrastive pairs should preserve semantic meaning while discarding unwanted variations related to the data acquisition domain. Traditional contrastive pipelines attempt to simulate domain shifts through pre-defined generic image transformations. However, these do not always mimic realistic and relevant domain variations for medical imaging such as scanner differences. To tackle this issue, we herein introduce counterfactual contrastive learning, a novel framework leveraging recent advances in causal image synthesis to create contrastive positive pairs that faithfully capture relevant domain variations. Our method, evaluated across five datasets encompassing both chest radiography and mammography data, for two established contrastive objectives (SimCLR and DINO-v2), outperforms standard contrastive learning in terms of robustness to acquisition shift. Notably, counterfactual contrastive learning achieves superior downstream performance on both in-distribution and on external datasets, especially for images acquired with scanners under-represented in the training set. Further experiments show that the proposed framework extends beyond acquisition shifts, with models trained with counterfactual contrastive learning substantially improving subgroup performance across biological sex.

[AI-9] Point2Graph: An End-to-end Point Cloud-based 3D Open-Vocabulary Scene Graph for Robot Navigation

链接: https://arxiv.org/abs/2409.10350
作者: Yifan Xu,Ziming Luo,Qianwei Wang,Vineet Kamat,Carol Menassa
关键词-EN: scene graph generation, open-vocabulary scene graph, posed RGB-D images, algorithms highly rely, RGB-D images
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 9 figures

点击查看摘要

Abstract:Current open-vocabulary scene graph generation algorithms highly rely on both 3D scene point cloud data and posed RGB-D images and thus have limited applications in scenarios where RGB-D images or camera poses are not readily available. To solve this problem, we propose Point2Graph, a novel end-to-end point cloud-based 3D open-vocabulary scene graph generation framework in which the requirement of posed RGB-D image series is eliminated. This hierarchical framework contains room and object detection/segmentation and open-vocabulary classification. For the room layer, we leverage the advantage of merging the geometry-based border detection algorithm with the learning-based region detection to segment rooms and create a “Snap-Lookup” framework for open-vocabulary room classification. In addition, we create an end-to-end pipeline for the object layer to detect and classify 3D objects based solely on 3D point cloud data. Our evaluation results show that our framework can outperform the current state-of-the-art (SOTA) open-vocabulary object and room segmentation and classification algorithm on widely used real-scene datasets.

[AI-10] Large Language Model Enhanced Hard Sample Identification for Denoising Recommendation

链接: https://arxiv.org/abs/2409.10343
作者: Tianrui Song,Wenshuo Chao,Hao Liu
关键词-EN: unavoidably confronts noise, Implicit feedback, build recommender systems, unavoidably confronts, position bias
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Implicit feedback, often used to build recommender systems, unavoidably confronts noise due to factors such as misclicks and position bias. Previous studies have attempted to alleviate this by identifying noisy samples based on their diverged patterns, such as higher loss values, and mitigating the noise through sample dropping or reweighting. Despite the progress, we observe existing approaches struggle to distinguish hard samples and noise samples, as they often exhibit similar patterns, thereby limiting their effectiveness in denoising recommendations. To address this challenge, we propose a Large Language Model Enhanced Hard Sample Denoising (LLMHD) framework. Specifically, we construct an LLM-based scorer to evaluate the semantic consistency of items with the user preference, which is quantified based on summarized historical user interactions. The resulting scores are used to assess the hardness of samples for the pointwise or pairwise training objectives. To ensure efficiency, we introduce a variance-based sample pruning strategy to filter potential hard samples before scoring. Besides, we propose an iterative preference update module designed to continuously refine summarized user preference, which may be biased due to false-positive user-item interactions. Extensive experiments on three real-world datasets and four backbone recommenders demonstrate the effectiveness of our approach.

[AI-11] Hyperedge Modeling in Hypergraph Neural Networks by using Densest Overlapping Subgraphs

链接: https://arxiv.org/abs/2409.10340
作者: Mehrad Soltani,Luis Rueda
关键词-EN: Hypergraph Neural Networks, Graph Neural Networks, traditional Graph Neural, Neural Networks, tackle the limitations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Hypergraphs tackle the limitations of traditional graphs by introducing \em hyperedges. While graph edges connect only two nodes, hyperedges connect an arbitrary number of nodes along their edges. Also, the underlying message-passing mechanisms in Hypergraph Neural Networks (HGNNs) are in the form of vertex-hyperedge-vertex, which let HGNNs capture and utilize richer and more complex structural information than traditional Graph Neural Networks (GNNs). More recently, the idea of overlapping subgraphs has emerged. These subgraphs can capture more information about subgroups of vertices without limiting one vertex belonging to just one group, allowing vertices to belong to multiple groups or subgraphs. In addition, one of the most important problems in graph clustering is to find densest overlapping subgraphs (DOS). In this paper, we propose a solution to the DOS problem via Agglomerative Greedy Enumeration (DOSAGE) algorithm as a novel approach to enhance the process of generating the densest overlapping subgraphs and, hence, a robust construction of the hypergraphs. Experiments on standard benchmarks show that the DOSAGE algorithm significantly outperforms the HGNNs and six other methods on the node classification task.

[AI-12] he 20 questions game to distinguish large language models

链接: https://arxiv.org/abs/2409.10338
作者: Gurvan Richardeau,Erwan Le Merrer,Camilla Penzo,Gilles Tredan
关键词-EN: large language models, black-box context, large language, questions game, questions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In a parallel with the 20 questions game, we present a method to determine whether two large language models (LLMs), placed in a black-box context, are the same or not. The goal is to use a small set of (benign) binary questions, typically under 20. We formalize the problem and first establish a baseline using a random selection of questions from known benchmark datasets, achieving an accuracy of nearly 100% within 20 questions. After showing optimal bounds for this problem, we introduce two effective questioning heuristics able to discriminate 22 LLMs by using half as many questions for the same task. These methods offer significant advantages in terms of stealth and are thus of interest to auditors or copyright owners facing suspicions of model leaks.

[AI-13] InfoDisent: Explainability of Image Classification Models by Information Disentanglement

链接: https://arxiv.org/abs/2409.10329
作者: Łukasz Struski,Jacek Tabor
关键词-EN: critical area, area of research, methods, post-hoc methods, decisions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Understanding the decisions made by image classification networks is a critical area of research in deep learning. This task is traditionally divided into two distinct approaches: post-hoc methods and intrinsic methods. Post-hoc methods, such as GradCam, aim to interpret the decisions of pre-trained models by identifying regions of the image where the network focuses its attention. However, these methods provide only a high-level overview, making it difficult to fully understand the network’s decision-making process. Conversely, intrinsic methods, like prototypical parts models, offer a more detailed understanding of network predictions but are constrained by specific architectures, training methods, and datasets. In this paper, we introduce InfoDisent, a hybrid model that combines the advantages of both approaches. By utilizing an information bottleneck, InfoDisent disentangles the information in the final layer of a pre-trained deep network, enabling the breakdown of classification decisions into basic, understandable atomic components. Unlike standard prototypical parts approaches, InfoDisent can interpret the decisions of pre-trained classification networks and be used for making classification decisions, similar to intrinsic models. We validate the effectiveness of InfoDisent on benchmark datasets such as ImageNet, CUB-200-2011, Stanford Cars, and Stanford Dogs for both convolutional and transformer backbones. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.10329 [cs.CV] (or arXiv:2409.10329v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.10329 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-14] SEAL: Towards Safe Autonomous Driving via Skill-Enabled Adversary Learning for Closed-Loop Scenario Generation

链接: https://arxiv.org/abs/2409.10320
作者: Benjamin Stoler,Ingrid Navarro,Jonathan Francis,Jean Oh
关键词-EN: Verification and validation, autonomous driving, systems and components, increasing importance, validation of autonomous
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Verification and validation of autonomous driving (AD) systems and components is of increasing importance, as such technology increases in real-world prevalence. Safety-critical scenario generation is a key approach to robustify AD policies through closed-loop training. However, existing approaches for scenario generation rely on simplistic objectives, resulting in overly-aggressive or non-reactive adversarial behaviors. To generate diverse adversarial yet realistic scenarios, we propose SEAL, a scenario perturbation approach which leverages learned scoring functions and adversarial, human-like skills. SEAL-perturbed scenarios are more realistic than SOTA baselines, leading to improved ego task success across real-world, in-distribution, and out-of-distribution scenarios, of more than 20%. To facilitate future research, we release our code and tools: this https URL

[AI-15] Know your limits! Optimize the robots behavior through self-awareness

链接: https://arxiv.org/abs/2409.10308
作者: Esteve Valls Mascaro,Dongheui Lee
关键词-EN: humanoid robots transition, real-world environments, non-expert users, transition from labs, labs to real-world
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Accepted to Humanoids 2024 and HFR 2024. Project Page: this https URL

点击查看摘要

Abstract:As humanoid robots transition from labs to real-world environments, it is essential to democratize robot control for non-expert users. Recent human-robot imitation algorithms focus on following a reference human motion with high precision, but they are susceptible to the quality of the reference motion and require the human operator to simplify its movements to match the robot’s capabilities. Instead, we consider that the robot should understand and adapt the reference motion to its own abilities, facilitating the operator’s task. For that, we introduce a deep-learning model that anticipates the robot’s performance when imitating a given reference. Then, our system can generate multiple references given a high-level task command, assign a score to each of them, and select the best reference to achieve the desired robot behavior. Our Self-AWare model (SAW) ranks potential robot behaviors based on various criteria, such as fall likelihood, adherence to the reference motion, and smoothness. We integrate advanced motion generation, robot control, and SAW in one unique system, ensuring optimal robot behavior for any task command. For instance, SAW can anticipate falls with 99.29% accuracy. For more information check our project page: this https URL

[AI-16] How to do impactful research in artificial intelligence for chemistry and materials science

链接: https://arxiv.org/abs/2409.10304
作者: Austin Cheng,Cher Tian Ser,Marta Skreta,Andrés Guzmán-Cordero,Luca Thiede,Andreas Burger,Abdulrahman Aldossary,Shi Xuan Leong,Sergio Pablo-García,Felix Strieth-Kalthoff,Alán Aspuru-Guzik
关键词-EN: pervasively touching, Machine learning, Machine, learning, Abstract
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Machine learning has been pervasively touching many fields of science. Chemistry and materials science are no exception. While machine learning has been making a great impact, it is still not reaching its full potential or maturity. In this perspective, we first outline current applications across a diversity of problems in chemistry. Then, we discuss how machine learning researchers view and approach problems in the field. Finally, we provide our considerations for maximizing impact when researching machine learning for chemistry.

[AI-17] On Synthetic Texture Datasets: Challenges Creation and Curation

链接: https://arxiv.org/abs/2409.10297
作者: Blaine Hoak,Patrick McDaniel
关键词-EN: machine learning models, machine learning, texture, ongoing investigation, texture images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The influence of textures on machine learning models has been an ongoing investigation, specifically in texture bias/learning, interpretability, and robustness. However, due to the lack of large and diverse texture data available, the findings in these works have been limited, as more comprehensive evaluations have not been feasible. Image generative models are able to provide data creation at scale, but utilizing these models for texture synthesis has been unexplored and poses additional challenges both in creating accurate texture images and validating those images. In this work, we introduce an extensible methodology and corresponding new dataset for generating high-quality, diverse texture images capable of supporting a broad set of texture-based tasks. Our pipeline consists of: (1) developing prompts from a range of descriptors to serve as input to text-to-image models, (2) adopting and adapting Stable Diffusion pipelines to generate and filter the corresponding images, and (3) further filtering down to the highest quality images. Through this, we create the Prompted Textures Dataset (PTD), a dataset of 362,880 texture images that span 56 textures. During the process of generating images, we find that NSFW safety filters in image generation pipelines are highly sensitive to texture (and flag up to 60% of our texture images), uncovering a potential bias in these models and presenting unique challenges when working with texture data. Through both standard metrics and a human evaluation, we find that our dataset is high quality and diverse.

[AI-18] MGSA: Multi-granularity Graph Structure Attention for Knowledge Graph-to-Text Generation

链接: https://arxiv.org/abs/2409.10294
作者: Shanshan Wang,Chun Zhang,Ning Zhang
关键词-EN: convert structured knowledge, structured knowledge graphs, Generation task aims, human-readable natural language, knowledge graph structure
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Knowledge Graph-to-Text Generation task aims to convert structured knowledge graphs into coherent and human-readable natural language text. Recent efforts in this field have focused on enhancing pre-trained language models (PLMs) by incorporating graph structure information to capture the intricate structure details of knowledge graphs. However, most of these approaches tend to capture only single-granularity structure information, concentrating either on the relationships between entities within the original graph or on the relationships between words within the same entity or across different entities. This narrow focus results in a significant limitation: models that concentrate solely on entity-level structure fail to capture the nuanced semantic relationships between words, while those that focus only on word-level structure overlook the broader relationships between original entire entities. To overcome these limitations, this paper introduces the Multi-granularity Graph Structure Attention (MGSA), which is based on PLMs. The encoder of the model architecture features an entity-level structure encoding module, a word-level structure encoding module, and an aggregation module that synthesizes information from both structure. This multi-granularity structure encoding approach allows the model to simultaneously capture both entity-level and word-level structure information, providing a more comprehensive understanding of the knowledge graph’s structure information, thereby significantly improving the quality of the generated text. We conducted extensive evaluations of the MGSA model using two widely recognized KG-to-Text Generation benchmark datasets, WebNLG and EventNarrative, where it consistently outperformed models that rely solely on single-granularity structure information, demonstrating the effectiveness of our approach.

[AI-19] ReflectDiffu: Reflect between Emotion-intent Contagion and Mimicry for Empathetic Response Generation via a RL-Diffusion Framework

链接: https://arxiv.org/abs/2409.10289
作者: Jiahao Yuan,Zixiang Di,Zhiqing Cui,Guisong Yang,Usman Naseem
关键词-EN: foster meaningful interactions, Empathetic response generation, meaningful interactions, response generation necessitates, necessitates the integration
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Empathetic response generation necessitates the integration of emotional and intentional dynamics to foster meaningful interactions. Existing research either neglects the intricate interplay between emotion and intent, leading to suboptimal controllability of empathy, or resorts to large language models (LLMs), which incur significant computational overhead. In this paper, we introduce ReflectDiffu, a lightweight and comprehensive framework for empathetic response generation. This framework incorporates emotion contagion to augment emotional expressiveness and employs an emotion-reasoning mask to pinpoint critical emotional elements. Additionally, it integrates intent mimicry within reinforcement learning for refinement during diffusion. By harnessing an intent twice reflect the mechanism of Exploring-Sampling-Correcting, ReflectDiffu adeptly translates emotional decision-making into precise intent actions, thereby addressing empathetic response misalignments stemming from emotional misrecognition. Through reflection, the framework maps emotional states to intents, markedly enhancing both response empathy and flexibility. Comprehensive experiments reveal that ReflectDiffu outperforms existing models regarding relevance, controllability, and informativeness, achieving state-of-the-art results in both automatic and human evaluations.

[AI-20] DreamHead: Learning Spatial-Temporal Correspondence via Hierarchical Diffusion for Audio-driven Talking Head Synthesis

链接: https://arxiv.org/abs/2409.10281
作者: Fa-Ting Hong,Yunfei Liu,Yu Li,Changyin Zhou,Fei Yu,Dan Xu
关键词-EN: Audio-driven talking head, talking head, generate lifelike video, talking head synthesis, head synthesis strives
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Audio-driven talking head synthesis strives to generate lifelike video portraits from provided audio. The diffusion model, recognized for its superior quality and robust generalization, has been explored for this task. However, establishing a robust correspondence between temporal audio cues and corresponding spatial facial expressions with diffusion models remains a significant challenge in talking head generation. To bridge this gap, we present DreamHead, a hierarchical diffusion framework that learns spatial-temporal correspondences in talking head synthesis without compromising the model’s intrinsic quality and adaptability.~DreamHead learns to predict dense facial landmarks from audios as intermediate signals to model the spatial and temporal correspondences.~Specifically, a first hierarchy of audio-to-landmark diffusion is first designed to predict temporally smooth and accurate landmark sequences given audio sequence signals. Then, a second hierarchy of landmark-to-image diffusion is further proposed to produce spatially consistent facial portrait videos, by modeling spatial correspondences between the dense facial landmark and appearance. Extensive experiments show that proposed DreamHead can effectively learn spatial-temporal consistency with the designed hierarchical diffusion and produce high-fidelity audio-driven talking head videos for multiple identities.

[AI-21] Cognitive Kernel: An Open-source Agent System towards Generalist Autopilots

链接: https://arxiv.org/abs/2409.10277
作者: Hongming Zhang,Xiaoman Pan,Hongwei Wang,Kaixin Ma,Wenhao Yu,Dong Yu
关键词-EN: introduce Cognitive Kernel, Cognitive Kernel, goal of generalist, Kernel, generalist autopilots
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce Cognitive Kernel, an open-source agent system towards the goal of generalist autopilots. Unlike copilot systems, which primarily rely on users to provide essential state information (e.g., task descriptions) and assist users by answering questions or auto-completing contents, autopilot systems must complete tasks from start to finish independently, which requires the system to acquire the state information from the environments actively. To achieve this, an autopilot system should be capable of understanding user intents, actively gathering necessary information from various real-world sources, and making wise decisions. Cognitive Kernel adopts a model-centric design. In our implementation, the central policy model (a fine-tuned LLM) initiates interactions with the environment using a combination of atomic actions, such as opening files, clicking buttons, saving intermediate results to memory, or calling the LLM itself. This differs from the widely used environment-centric design, where a task-specific environment with predefined actions is fixed, and the policy model is limited to selecting the correct action from a given set of options. Our design facilitates seamless information flow across various sources and provides greater flexibility. We evaluate our system in three use cases: real-time information management, private information management, and long-term memory management. The results demonstrate that Cognitive Kernel achieves better or comparable performance to other closed-source systems in these scenarios. Cognitive Kernel is fully dockerized, ensuring everyone can deploy it privately and securely. We open-source the system and the backbone model to encourage further research on LLM-driven autopilot systems.

[AI-22] Causal Discovery in Recommender Systems: Example and Discussion RECSYS’24

链接: https://arxiv.org/abs/2409.10271
作者: Emanuele Cavenaghi,Fabio Stella,Markus Zanker
关键词-EN: receiving increasing attention, Causality is receiving, machine learning communities, receiving increasing, increasing attention
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Accepted at the CONSEQUENCES '24 workshop, co-located with ACM RecSys '24

点击查看摘要

Abstract:Causality is receiving increasing attention by the artificial intelligence and machine learning communities. This paper gives an example of modelling a recommender system problem using causal graphs. Specifically, we approached the causal discovery task to learn a causal graph by combining observational data from an open-source dataset with prior knowledge. The resulting causal graph shows that only a few variables effectively influence the analysed feedback signals. This contrasts with the recent trend in the machine learning community to include more and more variables in massive models, such as neural networks.

[AI-23] Enhancing Personalized Recipe Recommendation Through Multi-Class Classification

链接: https://arxiv.org/abs/2409.10267
作者: Harish Neelam,Koushik Sai Veerella
关键词-EN: diverse culinary preferences, intends to address, address the challenge, realm of diverse, association analysis
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper intends to address the challenge of personalized recipe recommendation in the realm of diverse culinary preferences. The problem domain involves recipe recommendations, utilizing techniques such as association analysis and classification. Association analysis explores the relationships and connections between different ingredients to enhance the user experience. Meanwhile, the classification aspect involves categorizing recipes based on user-defined ingredients and preferences. A unique aspect of the paper is the consideration of recipes and ingredients belonging to multiple classes, recognizing the complexity of culinary combinations. This necessitates a sophisticated approach to classification and recommendation, ensuring the system accommodates the nature of recipe categorization. The paper seeks not only to recommend recipes but also to explore the process involved in achieving accurate and personalized recommendations.

[AI-24] Hedging Is Not All You Need: A Simple Baseline for Online Learning Under Haphazard Inputs

链接: https://arxiv.org/abs/2409.10242
作者: Himanshu Buckchash,Momojit Biswas,Rohit Agarwal,Dilip K. Prasad
关键词-EN: Handling haphazard streaming, haphazard streaming data, Handling haphazard, edge devices, haphazard streaming
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Handling haphazard streaming data, such as data from edge devices, presents a challenging problem. Over time, the incoming data becomes inconsistent, with missing, faulty, or new inputs reappearing. Therefore, it requires models that are reliable. Recent methods to solve this problem depend on a hedging-based solution and require specialized elements like auxiliary dropouts, forked architectures, and intricate network design. We observed that hedging can be reduced to a special case of weighted residual connection; this motivated us to approximate it with plain self-attention. In this work, we propose HapNet, a simple baseline that is scalable, does not require online backpropagation, and is adaptable to varying input types. All present methods are restricted to scaling with a fixed window; however, we introduce a more complex problem of scaling with a variable window where the data becomes positionally uncorrelated, and cannot be addressed by present methods. We demonstrate that a variant of the proposed approach can work even for this complex scenario. We extensively evaluated the proposed approach on five benchmarks and found competitive performance.

[AI-25] NEUSIS: A Compositional Neuro-Symbolic Framework for Autonomous Perception Reasoning and Planning in Complex UAV Search Missions

链接: https://arxiv.org/abs/2409.10196
作者: Zhixi Cai,Cristian Rojas Cardenas,Kevin Leo,Chenyuan Zhang,Kal Backman,Hanbing Li,Boying Li,Mahsa Ghorbanali,Stavya Datta,Lizhen Qu,Julian Gutierrez Santiago,Alexey Ignatiev,Yuan-Fang Li,Mor Vered,Peter J Stuckey,Maria Garcia de la Banda,Hamid Rezatofighi
关键词-EN: locate specific Entities, Entities of Interest, specific Entities, time limit, descriptions in large
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper addresses the problem of autonomous UAV search missions, where a UAV must locate specific Entities of Interest (EOIs) within a time limit, based on brief descriptions in large, hazard-prone environments with keep-out zones. The UAV must perceive, reason, and make decisions with limited and uncertain information. We propose NEUSIS, a compositional neuro-symbolic system designed for interpretable UAV search and navigation in realistic scenarios. NEUSIS integrates neuro-symbolic visual perception, reasoning, and grounding (GRiD) to process raw sensory inputs, maintains a probabilistic world model for environment representation, and uses a hierarchical planning component (SNaC) for efficient path planning. Experimental results from simulated urban search missions using AirSim and Unreal Engine show that NEUSIS outperforms a state-of-the-art (SOTA) vision-language model and a SOTA search planning model in success rate, search efficiency, and 3D localization. These results demonstrate the effectiveness of our compositional neuro-symbolic approach in handling complex, real-world scenarios, making it a promising solution for autonomous UAV systems in search missions.

[AI-26] Relative Positioning for Aerial Robot Path Planning in GPS Denied Environment

链接: https://arxiv.org/abs/2409.10193
作者: Farzad Sanati
关键词-EN: Unmanned Aerial Vehicles, called Unmanned Aerial, intelligent aerial robots, Aerial Vehicles, Unmanned Aerial
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 12 pages, 4 images

点击查看摘要

Abstract:One of the most useful applications of intelligent aerial robots sometimes called Unmanned Aerial Vehicles (UAV) in Australia is known to be in bushfire monitoring and prediction operations. A swarm of autonomous drones/UAVs programmed to work in real-time observing the fire parameters using their onboard sensors would be valuable in reducing the life-threatening impact of that fire. However autonomous UAVs face serious challenges in their positioning and navigation in critical bushfire conditions such as remoteness and severe weather conditions where GPS signals could also be unreliable. This paper tackles one of the most important factors in autonomous UAV navigation, namely Initial Positioning sometimes called Localisation. The solution provided by this paper will enable a team of autonomous UAVs to establish a relative position to their base of operation to be able to commence a team search and reconnaissance in a bushfire-affected area and find their way back to their base without the help of GPS signals.

[AI-27] Augmenting Automatic Speech Recognition Models with Disfluency Detection

链接: https://arxiv.org/abs/2409.10177
作者: Robin Amann,Zhaolin Li,Barbara Bruno,Jan Niehues
关键词-EN: disfluency commonly occurs, Automatic Speech Recognition, Speech disfluency commonly, disfluency commonly, commonly occurs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted by SLT2024

点击查看摘要

Abstract:Speech disfluency commonly occurs in conversational and spontaneous speech. However, standard Automatic Speech Recognition (ASR) models struggle to accurately recognize these disfluencies because they are typically trained on fluent transcripts. Current research mainly focuses on detecting disfluencies within transcripts, overlooking their exact location and duration in the speech. Additionally, previous work often requires model fine-tuning and addresses limited types of disfluencies. In this work, we present an inference-only approach to augment any ASR model with the ability to detect open-set disfluencies. We first demonstrate that ASR models have difficulty transcribing speech disfluencies. Next, this work proposes a modified Connectionist Temporal Classification(CTC)-based forced alignment algorithm from \citekurzinger2020ctc to predict word-level timestamps while effectively capturing disfluent speech. Additionally, we develop a model to classify alignment gaps between timestamps as either containing disfluent speech or silence. This model achieves an accuracy of 81.62% and an F1-score of 80.07%. We test the augmentation pipeline of alignment gap detection and classification on a disfluent dataset. Our results show that we captured 74.13% of the words that were initially missed by the transcription, demonstrating the potential of this pipeline for downstream tasks. Comments: Accepted by SLT2024 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.10177 [cs.CL] (or arXiv:2409.10177v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.10177 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-28] jina-embeddings-v3: Multilingual Embeddings With Task LoRA

链接: https://arxiv.org/abs/2409.10173
作者: Saba Sturua,Isabelle Mohr,Mohammad Kalim Akram,Michael Günther,Bo Wang,Markus Krimmel,Feng Wang,Georgios Mastrapas,Andreas Koukounas,Andreas Koukounas,Nan Wang,Han Xiao
关键词-EN: supporting context lengths, million parameters, supporting context, Matryoshka Representation Learning, long-context retrieval tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 20 pages, pp11-13 references, pp14-20 appendix and experiment tables

点击查看摘要

Abstract:We introduce jina-embeddings-v3, a novel text embedding model with 570 million parameters, achieves state-of-the-art performance on multilingual data and long-context retrieval tasks, supporting context lengths of up to 8192 tokens. The model includes a set of task-specific Low-Rank Adaptation (LoRA) adapters to generate high-quality embeddings for query-document retrieval, clustering, classification, and text matching. Additionally, Matryoshka Representation Learning is integrated into the training process, allowing flexible truncation of embedding dimensions without compromising performance. Evaluation on the MTEB benchmark shows that jina-embeddings-v3 outperforms the latest proprietary embeddings from OpenAI and Cohere on English tasks, while achieving superior performance compared to multilingual-e5-large-instruct across all multilingual tasks.

[AI-29] Algorithmic Behaviors Across Regions: A Geolocation Audit of YouTube Search for COVID-19 Misinformation between the United States and South Africa

链接: https://arxiv.org/abs/2409.10168
作者: Hayoung Jung,Prerna Juneja,Tanushree Mitra
关键词-EN: health-related information online, finding health-related information, Global South, Global North contexts, Global North
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 28 pages. Under submission

点击查看摘要

Abstract:Despite being an integral tool for finding health-related information online, YouTube has faced criticism for disseminating COVID-19 misinformation globally to its users. Yet, prior audit studies have predominantly investigated YouTube within the Global North contexts, often overlooking the Global South. To address this gap, we conducted a comprehensive 10-day geolocation-based audit on YouTube to compare the prevalence of COVID-19 misinformation in search results between the United States (US) and South Africa (SA), the countries heavily affected by the pandemic in the Global North and the Global South, respectively. For each country, we selected 3 geolocations and placed sock-puppets, or bots emulating “real” users, that collected search results for 48 search queries sorted by 4 search filters for 10 days, yielding a dataset of 915K results. We found that 31.55% of the top-10 search results contained COVID-19 misinformation. Among the top-10 search results, bots in SA faced significantly more misinformative search results than their US counterparts. Overall, our study highlights the contrasting algorithmic behaviors of YouTube search between two countries, underscoring the need for the platform to regulate algorithmic behavior consistently across different regions of the Globe.

[AI-30] Quantile Regression for Distributional Reward Models in RLHF

链接: https://arxiv.org/abs/2409.10164
作者: Nicolai Dorka
关键词-EN: aligning large language, large language models, RLHF, aligning large, large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) has become a key method for aligning large language models (LLMs) with human preferences through the use of reward models. However, traditional reward models typically generate point estimates, which oversimplify the diversity and complexity of human values and preferences. In this paper, we introduce Quantile Reward Models (QRMs), a novel approach to reward modeling that learns a distribution over rewards instead of a single scalar value. Our method uses quantile regression to estimate a full, potentially multimodal distribution over preferences, providing a more powerful and nuanced representation of preferences. This distributional approach can better capture the diversity of human values, addresses label noise, and accommodates conflicting preferences by modeling them as distinct modes in the distribution. Our experimental results show that QRM outperforms comparable traditional point-estimate models on RewardBench. Furthermore, we demonstrate that the additional information provided by the distributional estimates can be utilized in downstream applications, such as risk-aware reinforcement learning, resulting in LLM policies that generate fewer extremely negative responses. Our code and model are released at this https URL.

[AI-31] SplatSim: Zero-Shot Sim2Real Transfer of RGB Manipulation Policies Using Gaussian Splatting

链接: https://arxiv.org/abs/2409.10161
作者: Mohammad Nomaan Qureshi,Sparsh Garg,Francisco Yandun,David Held,George Kantor,Abhishesh Silwal
关键词-EN: significant domain shift, RGB images, relying on RGB, manipulation policies relying, remains a critical
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sim2Real transfer, particularly for manipulation policies relying on RGB images, remains a critical challenge in robotics due to the significant domain shift between synthetic and real-world visual data. In this paper, we propose SplatSim, a novel framework that leverages Gaussian Splatting as the primary rendering primitive to reduce the Sim2Real gap for RGB-based manipulation policies. By replacing traditional mesh representations with Gaussian Splats in simulators, SplatSim produces highly photorealistic synthetic data while maintaining the scalability and cost-efficiency of simulation. We demonstrate the effectiveness of our framework by training manipulation policies within SplatSimand deploying them in the real world in a zero-shot manner, achieving an average success rate of 86.25%, compared to 97.5% for policies trained on real-world data.

[AI-32] AutoPET Challenge III: Testing the Robustness of Generalized Dice Focal Loss trained 3D Residual UNet for FDG and PSMA Lesion Segmentation from Whole-Body PET/CT Images

链接: https://arxiv.org/abs/2409.10151
作者: Shadab Ahamed
关键词-EN: Automated segmentation, quantitative image analysis, crucial first step, step in quantitative, Generalized Dice Focal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
*备注: 11 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Automated segmentation of cancerous lesions in PET/CT scans is a crucial first step in quantitative image analysis. However, training deep learning models for segmentation with high accuracy is particularly challenging due to the variations in lesion size, shape, and radiotracer uptake. These lesions can appear in different parts of the body, often near healthy organs that also exhibit considerable uptake, making the task even more complex. As a result, creating an effective segmentation model for routine PET/CT image analysis is challenging. In this study, we utilized a 3D Residual UNet model and employed the Generalized Dice Focal Loss function to train the model on the AutoPET Challenge 2024 dataset. We conducted a 5-fold cross-validation and used an average ensembling technique using the models from the five folds. In the preliminary test phase for Task-1, the average ensemble achieved a mean Dice Similarity Coefficient (DSC) of 0.6687, mean false negative volume (FNV) of 10.9522 ml and mean false positive volume (FPV) 2.9684 ml. More details about the algorithm can be found on our GitHub repository: this https URL. The training code has been shared via the repository: this https URL.

[AI-33] LLMs4OL 2024 Overview: The 1st Large Language Models for Ontology Learning Challenge ISWC2024

链接: https://arxiv.org/abs/2409.10146
作者: Hamed Babaei Giglou,Jennifer D’Souza,Sören Auer
关键词-EN: Large Language Models, Ontology Learning Challenge, Large Language, Language Models, Ontology Learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 15 pages, 1 figure, Will appear in “The 1st LLMs4OL Challenge @ ISWC 2024” proceedings

点击查看摘要

Abstract:This paper outlines the LLMs4OL 2024, the first edition of the Large Language Models for Ontology Learning Challenge. LLMs4OL is a community development initiative collocated with the 23rd International Semantic Web Conference (ISWC) to explore the potential of Large Language Models (LLMs) in Ontology Learning (OL), a vital process for enhancing the web with structured knowledge to improve interoperability. By leveraging LLMs, the challenge aims to advance understanding and innovation in OL, aligning with the goals of the Semantic Web to create a more intelligent and user-friendly web. In this paper, we give an overview of the 2024 edition of the LLMs4OL challenge and summarize the contributions.

[AI-34] owards Explainable Automated Data Quality Enhancement without Domain Knowledge

链接: https://arxiv.org/abs/2409.10139
作者: Djibril Sarr
关键词-EN: era of big, increasingly crucial, data quality, data, big data
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In the era of big data, ensuring the quality of datasets has become increasingly crucial across various domains. We propose a comprehensive framework designed to automatically assess and rectify data quality issues in any given dataset, regardless of its specific content, focusing on both textual and numerical data. Our primary objective is to address three fundamental types of defects: absence, redundancy, and incoherence. At the heart of our approach lies a rigorous demand for both explainability and interpretability, ensuring that the rationale behind the identification and correction of data anomalies is transparent and understandable. To achieve this, we adopt a hybrid approach that integrates statistical methods with machine learning algorithms. Indeed, by leveraging statistical techniques alongside machine learning, we strike a balance between accuracy and explainability, enabling users to trust and comprehend the assessment process. Acknowledging the challenges associated with automating the data quality assessment process, particularly in terms of time efficiency and accuracy, we adopt a pragmatic strategy, employing resource-intensive algorithms only when necessary, while favoring simpler, more efficient solutions whenever possible. Through a practical analysis conducted on a publicly provided dataset, we illustrate the challenges that arise when trying to enhance data quality while keeping explainability. We demonstrate the effectiveness of our approach in detecting and rectifying missing values, duplicates and typographical errors as well as the challenges remaining to be addressed to achieve similar accuracy on statistical outliers and logic errors under the constraints set in our work.

[AI-35] Advancing Towards a Marine Digital Twin Platform: Modeling the Mar Menor Coastal Lagoon Ecosystem in the South Western Mediterranean

链接: https://arxiv.org/abs/2409.10134
作者: Yu Ye,Aurora González-Vidal,Alejandro Cisterna-García,Angel Pérez-Ruzafa,Miguel A. Zamora Izquierdo,Antonio F. Skarmeta
关键词-EN: necessitating advanced monitoring, face mounting pressures, Mar Menor Coastal, ecosystems face mounting, Menor Coastal Lagoon
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Coastal marine ecosystems face mounting pressures from anthropogenic activities and climate change, necessitating advanced monitoring and modeling approaches for effective management. This paper pioneers the development of a Marine Digital Twin Platform aimed at modeling the Mar Menor Coastal Lagoon Ecosystem in the Region of Murcia. The platform leverages Artificial Intelligence to emulate complex hydrological and ecological models, facilitating the simulation of what-if scenarios to predict ecosystem responses to various stressors. We integrate diverse datasets from public sources to construct a comprehensive digital representation of the lagoon’s dynamics. The platform’s modular design enables real-time stakeholder engagement and informed decision-making in marine management. Our work contributes to the ongoing discourse on advancing marine science through innovative digital twin technologies.

[AI-36] StruEdit: Structured Outputs Enable the Fast and Accurate Knowledge Editing for Large Language Models

链接: https://arxiv.org/abs/2409.10132
作者: Baolong Bi,Shenghua Liu,Yiwei Wang,Lingrui Mei,Hongcheng Gao,Junfeng Fang,Xueqi Cheng
关键词-EN: large language models, question answering, modern tool, tool of choice, choice for question
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As the modern tool of choice for question answering, large language models (LLMs) are expected to deliver answers with up-to-date knowledge. To achieve such ideal question-answering systems, locating and then editing outdated knowledge in the natural language outputs is a general target of popular knowledge editing methods. However, this target is challenging, as both identifying which tokens to edit in the reasoning steps and ensuring the coherence of the revised reasoning chain are difficult tasks. We argue that these challenges stem from the unstructured nature of natural language outputs. To address the above challenges, we propose \textbfStru ctural \textbfEdit ing ( \textbfStruEdit ), an improved baseline for knowledge editing. We first prompt LLMs to produce structured outputs consisting of reasoning triplets. Then, StruEdit removes any potentially outdated knowledge and efficiently refills the structured outputs with up-to-date information in a single step. Experimental results show that StruEdit consistently delivers the highest accuracy with lowest latency compared with other knowledge editing methods.

[AI-37] Industry 6.0: New Generation of Industry driven by Generative AI and Swarm of Heterogeneous Robots

链接: https://arxiv.org/abs/2409.10106
作者: Artem Lykov,Miguel Altamirano Cabrera,Mikhail Konenkov,Valerii Serpiva,Koffivi Fid`ele Gbagbe,Ali Alabbas,Aleksey Fedoseev,Luis Moreno,Muhammad Haris Khan,Ziang Guo,Dzmitry Tsetserukou
关键词-EN: natural language descriptions, user-provided natural language, concept of Industry, entire product design, Large Language Models
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: submitted to IEEE conf

点击查看摘要

Abstract:This paper presents the concept of Industry 6.0, introducing the world’s first fully automated production system that autonomously handles the entire product design and manufacturing process based on user-provided natural language descriptions. By leveraging generative AI, the system automates critical aspects of production, including product blueprint design, component manufacturing, logistics, and assembly. A heterogeneous swarm of robots, each equipped with individual AI through integration with Large Language Models (LLMs), orchestrates the production process. The robotic system includes manipulator arms, delivery drones, and 3D printers capable of generating assembly blueprints. The system was evaluated using commercial and open-source LLMs, functioning through APIs and local deployment. A user study demonstrated that the system reduces the average production time to 119.10 minutes, significantly outperforming a team of expert human developers, who averaged 528.64 minutes (an improvement factor of 4.4). Furthermore, in the product blueprinting stage, the system surpassed human CAD operators by an unprecedented factor of 47, completing the task in 0.5 minutes compared to 23.5 minutes. This breakthrough represents a major leap towards fully autonomous manufacturing.

[AI-38] rustworthiness in Retrieval-Augmented Generation Systems: A Survey

链接: https://arxiv.org/abs/2409.10102
作者: Yujia Zhou,Yan Liu,Xiaoxi Li,Jiajie Jin,Hongjin Qian,Zheng Liu,Chaozhuo Li,Zhicheng Dou,Tsung-Yi Ho,Philip S. Yu
关键词-EN: Large Language Models, Large Language, RAG systems, Retrieval-Augmented Generation, development of Large
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has quickly grown into a pivotal paradigm in the development of Large Language Models (LLMs). While much of the current research in this field focuses on performance optimization, particularly in terms of accuracy and efficiency, the trustworthiness of RAG systems remains an area still under exploration. From a positive perspective, RAG systems are promising to enhance LLMs by providing them with useful and up-to-date knowledge from vast external databases, thereby mitigating the long-standing problem of hallucination. While from a negative perspective, RAG systems are at the risk of generating undesirable contents if the retrieved information is either inappropriate or poorly utilized. To address these concerns, we propose a unified framework that assesses the trustworthiness of RAG systems across six key dimensions: factuality, robustness, fairness, transparency, accountability, and privacy. Within this framework, we thoroughly review the existing literature on each dimension. Additionally, we create the evaluation benchmark regarding the six dimensions and conduct comprehensive evaluations for a variety of proprietary and open-source models. Finally, we identify the potential challenges for future research based on our investigation results. Through this work, we aim to lay a structured foundation for future investigations and provide practical insights for enhancing the trustworthiness of RAG systems in real-world applications.

[AI-39] A Riemannian Approach to Ground Metric Learning for Optimal Transport

链接: https://arxiv.org/abs/2409.10085
作者: Pratik Jawanpuria,Dai Shi,Bamdev Mishra,Junbin Gao
关键词-EN: signal processing applications, Optimal transport, target data points, theory has attracted, processing applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Optimal transport (OT) theory has attracted much attention in machine learning and signal processing applications. OT defines a notion of distance between probability distributions of source and target data points. A crucial factor that influences OT-based distances is the ground metric of the embedding space in which the source and target data points lie. In this work, we propose to learn a suitable latent ground metric parameterized by a symmetric positive definite matrix. We use the rich Riemannian geometry of symmetric positive definite matrices to jointly learn the OT distance along with the ground metric. Empirical results illustrate the efficacy of the learned metric in OT-based domain adaptation.

[AI-40] DAE-Fuse: An Adaptive Discriminative Autoencoder for Multi-Modality Image Fusion

链接: https://arxiv.org/abs/2409.10080
作者: Yuchen Guo,Ruoxiang Xu,Rongcheng Li,Zhenghao Wu,Weifeng Su
关键词-EN: integrate complementary data, complementary data information, Multi-modality image fusion, Multi-modality image, image fusion aims
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multi-modality image fusion aims to integrate complementary data information from different imaging modalities into a single image. Existing methods often generate either blurry fused images that lose fine-grained semantic information or unnatural fused images that appear perceptually cropped from the inputs. In this work, we propose a novel two-phase discriminative autoencoder framework, termed DAE-Fuse, that generates sharp and natural fused images. In the adversarial feature extraction phase, we introduce two discriminative blocks into the encoder-decoder architecture, providing an additional adversarial loss to better guide feature extraction by reconstructing the source images. While the two discriminative blocks are adapted in the attention-guided cross-modality fusion phase to distinguish the structural differences between the fused output and the source inputs, injecting more naturalness into the results. Extensive experiments on public infrared-visible, medical image fusion, and downstream object detection datasets demonstrate our method’s superiority and generalizability in both quantitative and qualitative evaluations.

[AI-41] LLM-DER:A Named Entity Recognition Method Based on Large Language Models for Chinese Coal Chemical Domain

链接: https://arxiv.org/abs/2409.10077
作者: Le Xiao,Yunfei Xu,Jing Zhao
关键词-EN: Named Entity Recognition, Domain-specific Named Entity, domain-specific entity recognition, Entity Recognition, domain knowledge graphs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Domain-specific Named Entity Recognition (NER), whose goal is to recognize domain-specific entities and their categories, provides an important support for constructing domain knowledge graphs. Currently, deep learning-based methods are widely used and effective in NER tasks, but due to the reliance on large-scale labeled data. As a result, the scarcity of labeled data in a specific domain will limit its application.Therefore, many researches started to introduce few-shot methods and achieved some results. However, the entity structures in specific domains are often complex, and the current few-shot methods are difficult to adapt to NER tasks with complex features.Taking the Chinese coal chemical industry domain as an example,there exists a complex structure of multiple entities sharing a single entity, as well as multiple relationships for the same pair of entities, which affects the NER task under the sample less this http URL this paper, we propose a Large Language Models (LLMs)-based entity recognition framework LLM-DER for the domain-specific entity recognition problem in Chinese, which enriches the entity information by generating a list of relationships containing entity types through LLMs, and designing a plausibility and consistency evaluation method to remove misrecognized entities, which can effectively solve the complex structural entity recognition problem in a specific domain.The experimental results of this paper on the Resume dataset and the self-constructed coal chemical dataset Coal show that LLM-DER performs outstandingly in domain-specific entity recognition, not only outperforming the existing GPT-3.5-turbo baseline, but also exceeding the fully-supervised baseline, verifying its effectiveness in entity recognition.

[AI-42] Increasing faithfulness in human-human dialog summarization with Spoken Language Understanding tasks

链接: https://arxiv.org/abs/2409.10070
作者: Eunice Akani,Benoit Favre,Frederic Bechet,Romain Gemignani
关键词-EN: aims to provide, provide a concise, concise and coherent, conversations between multiple, multiple speakers
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Dialogue summarization aims to provide a concise and coherent summary of conversations between multiple speakers. While recent advancements in language models have enhanced this process, summarizing dialogues accurately and faithfully remains challenging due to the need to understand speaker interactions and capture relevant information. Indeed, abstractive models used for dialog summarization may generate summaries that contain inconsistencies. We suggest using the semantic information proposed for performing Spoken Language Understanding (SLU) in human-machine dialogue systems for goal-oriented human-human dialogues to obtain a more semantically faithful summary regarding the task. This study introduces three key contributions: First, we propose an exploration of how incorporating task-related information can enhance the summarization process, leading to more semantically accurate summaries. Then, we introduce a new evaluation criterion based on task semantics. Finally, we propose a new dataset version with increased annotated data standardized for research on task-oriented dialogue summarization. The study evaluates these methods using the DECODA corpus, a collection of French spoken dialogues from a call center. Results show that integrating models with task-related information improves summary accuracy, even with varying word error rates.

[AI-43] Enhancing Anomaly Detection via Generating Diversified and Hard-to-distinguish Synthetic Anomalies CIKM2024

链接: https://arxiv.org/abs/2409.10069
作者: Hyuntae Kim,Changhee Lee
关键词-EN: identify unseen anomalies, Unsupervised anomaly detection, Unsupervised anomaly, daunting task, normal samples
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at CIKM 2024

点击查看摘要

Abstract:Unsupervised anomaly detection is a daunting task, as it relies solely on normality patterns from the training data to identify unseen anomalies during testing. Recent approaches have focused on leveraging domain-specific transformations or perturbations to generate synthetic anomalies from normal samples. The objective here is to acquire insights into normality patterns by learning to differentiate between normal samples and these crafted anomalies. However, these approaches often encounter limitations when domain-specific transformations are not well-specified such as in tabular data, or when it becomes trivial to distinguish between them. To address these issues, we introduce a novel domain-agnostic method that employs a set of conditional perturbators and a discriminator. The perturbators are trained to generate input-dependent perturbations, which are subsequently utilized to construct synthetic anomalies, and the discriminator is trained to distinguish normal samples from them. We ensure that the generated anomalies are both diverse and hard to distinguish through two key strategies: i) directing perturbations to be orthogonal to each other and ii) constraining perturbations to remain in proximity to normal samples. Throughout experiments on real-world datasets, we demonstrate the superiority of our method over state-of-the-art benchmarks, which is evident not only in image data but also in tabular data, where domain-specific transformation is not readily accessible. Additionally, we empirically confirm the adaptability of our method to semi-supervised settings, demonstrating its capacity to incorporate supervised signals to enhance anomaly detection performance even further.

[AI-44] MindGuard: Towards Accessible and Sitgma-free Mental Health First Aid via Edge LLM

链接: https://arxiv.org/abs/2409.10064
作者: Sijie Ji,Xinzhe Zheng,Jiawei Sun,Renqi Chen,Wei Gao,Mani Srivastava
关键词-EN: prevalent diseases worldwide, diseases worldwide, prevalent diseases, Mental health disorders, Mental health
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Mental health disorders are among the most prevalent diseases worldwide, affecting nearly one in four people. Despite their widespread impact, the intervention rate remains below 25%, largely due to the significant cooperation required from patients for both diagnosis and intervention. The core issue behind this low treatment rate is stigma, which discourages over half of those affected from seeking help. This paper presents MindGuard, an accessible, stigma-free, and professional mobile mental healthcare system designed to provide mental health first aid. The heart of MindGuard is an innovative edge LLM, equipped with professional mental health knowledge, that seamlessly integrates objective mobile sensor data with subjective Ecological Momentary Assessment records to deliver personalized screening and intervention conversations. We conduct a broad evaluation of MindGuard using open datasets spanning four years and real-world deployment across various mobile devices involving 20 subjects for two weeks. Remarkably, MindGuard achieves results comparable to GPT-4 and outperforms its counterpart with more than 10 times the model size. We believe that MindGuard paves the way for mobile LLM applications, potentially revolutionizing mental healthcare practices by substituting self-reporting and intervention conversations with passive, integrated monitoring within daily life, thus ensuring accessible and stigma-free mental health support.

[AI-45] GlobalMapNet: An Online Framework for Vectorized Global HD Map Construction

链接: https://arxiv.org/abs/2409.10063
作者: Anqi Shi,Yuze Cai,Xiangyu Chen,Jian Pu,Zeyu Fu,Hong Lu
关键词-EN: autonomous driving systems, driving systems, essential for autonomous, autonomous driving, map
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:High-definition (HD) maps are essential for autonomous driving systems. Traditionally, an expensive and labor-intensive pipeline is implemented to construct HD maps, which is limited in scalability. In recent years, crowdsourcing and online mapping have emerged as two alternative methods, but they have limitations respectively. In this paper, we provide a novel methodology, namely global map construction, to perform direct generation of vectorized global maps, combining the benefits of crowdsourcing and online mapping. We introduce GlobalMapNet, the first online framework for vectorized global HD map construction, which updates and utilizes a global map on the ego vehicle. To generate the global map from scratch, we propose GlobalMapBuilder to match and merge local maps continuously. We design a new algorithm, Map NMS, to remove duplicate map elements and produce a clean map. We also propose GlobalMapFusion to aggregate historical map information, improving consistency of prediction. We examine GlobalMapNet on two widely recognized datasets, Argoverse2 and nuScenes, showing that our framework is capable of generating globally consistent results.

[AI-46] Audio-Driven Reinforcement Learning for Head-Orientation in Naturalistic Environments ICASSP2025

链接: https://arxiv.org/abs/2409.10048
作者: Wessel Ledder,Yuzhen Qin,Kiki van der Heijden
关键词-EN: audio signal processing, deep reinforcement learning, audio-driven DRL, DRL, audio-driven DRL framework
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: submitted to ICASSP 2025

点击查看摘要

Abstract:Although deep reinforcement learning (DRL) approaches in audio signal processing have seen substantial progress in recent years, audio-driven DRL for tasks such as navigation, gaze control and head-orientation control in the context of human-robot interaction have received little attention. Here, we propose an audio-driven DRL framework in which we utilise deep Q-learning to develop an autonomous agent that orients towards a talker in the acoustic environment based on stereo speech recordings. Our results show that the agent learned to perform the task at a near perfect level when trained on speech segments in anechoic environments (that is, without reverberation). The presence of reverberation in naturalistic acoustic environments affected the agent’s performance, although the agent still substantially outperformed a baseline, randomly acting agent. Finally, we quantified the degree of generalization of the proposed DRL approach across naturalistic acoustic environments. Our experiments revealed that policies learned by agents trained on medium or high reverb environments generalized to low reverb environments, but policies learned by agents trained on anechoic or low reverb environments did not generalize to medium or high reverb environments. Taken together, this study demonstrates the potential of audio-driven DRL for tasks such as head-orientation control and highlights the need for training strategies that enable robust generalization across environments for real-world audio-driven DRL applications.

[AI-47] On the Diagram of Thought

链接: https://arxiv.org/abs/2409.10038
作者: Yifan Zhang,Yang Yuan,Andrew Chi-Chih Yao
关键词-EN: directed acyclic graph, acyclic graph, cohesive DAG structure, Diagram of Thought, directed acyclic
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Diagram of Thought (DoT), a framework that models iterative reasoning in large language models (LLMs) as the construction of a directed acyclic graph (DAG) within a single model. Unlike traditional approaches that represent reasoning as linear chains or trees, DoT organizes propositions, critiques, refinements, and verifications into a cohesive DAG structure, allowing the model to explore complex reasoning pathways while maintaining logical consistency. Each node in the diagram corresponds to a proposition that has been proposed, critiqued, refined, or verified, enabling the LLM to iteratively improve its reasoning through natural language feedback. By leveraging auto-regressive next-token prediction with role-specific tokens, DoT facilitates seamless transitions between proposing ideas and critically evaluating them, providing richer feedback than binary signals. Furthermore, we formalize the DoT framework using Topos Theory, providing a mathematical foundation that ensures logical consistency and soundness in the reasoning process. This approach enhances both the training and inference processes within a single LLM, eliminating the need for multiple models or external control mechanisms. DoT offers a conceptual framework for designing next-generation reasoning-specialized models, emphasizing training efficiency, robust reasoning capabilities, and theoretical grounding. The code is available at this https URL.

[AI-48] Can GPT-O1 Kill All Bugs?

链接: https://arxiv.org/abs/2409.10033
作者: Haichuan Hu,Ye Shang,Guolin Xu,Congqing He,Quanjun Zhang
关键词-EN: automatic program repair, ChatGPT, long been proven, effective in automatic, automatic program
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:ChatGPT has long been proven to be effective in automatic program repair (APR). With the continuous iterations and upgrades of the ChatGPT version, its performance in terms of fixes has already reached state-of-the-art levels. However, there are few works comparing the effectiveness and variations of different versions of ChatGPT on APR. In this work, we evaluate the performance of the latest version of ChatGPT (O1-preview and O1-mini), ChatGPT-4o, and historical version of ChatGPT on APR. We study the improvements of the O1 model over traditional ChatGPT in terms of APR from multiple perspectives (repair success rate, repair cost, behavior patterns), and find that O1’s repair capability exceeds that of traditional ChatGPT, successfully fixing all 40 bugs in the benchmark. Our work can serve as a reference for further in-depth exploration of the applications of ChatGPT in APR.

[AI-49] AttnMod: Attention-Based New Art Styles

链接: https://arxiv.org/abs/2409.10028
作者: Shih-Chieh Su
关键词-EN: Imagine a human, hoping to create, create a painting, human artist, generated photo
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Imagine a human artist looking at the generated photo of a diffusion model, and hoping to create a painting out of it. There could be some feature of the object in the photo that the artist wants to emphasize, some color to disperse, some silhouette to twist, or some part of the scene to be materialized. These intentions can be viewed as the modification of the cross attention from the text prompt onto UNet, during the desoising diffusion. This work presents AttnMod, to modify attention for creating new unpromptable art styles out of existing diffusion models. The style-creating behavior is studied across different setups.

[AI-50] E2Map: Experience-and-Emotion Map for Self-Reflective Robot Navigation with Language Models

链接: https://arxiv.org/abs/2409.10027
作者: Chan Kim,Keonwoo Kim,Mintaek Oh,Hanbi Baek,Jiyang Lee,Donghwi Jung,Soojin Woo,Younkyung Woo,John Tucker,Roya Firoozi,Seung-Woo Seo,Mac Schwager,Seong-Woo Kim
关键词-EN: Large language models, execute language instructions, shown significant potential, Large language, guiding embodied agents
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 19 pages, 28 figures. Project page: this https URL

点击查看摘要

Abstract:Large language models (LLMs) have shown significant potential in guiding embodied agents to execute language instructions across a range of tasks, including robotic manipulation and navigation. However, existing methods are primarily designed for static environments and do not leverage the agent’s own experiences to refine its initial plans. Given that real-world environments are inherently stochastic, initial plans based solely on LLMs’ general knowledge may fail to achieve their objectives, unlike in static scenarios. To address this limitation, this study introduces the Experience-and-Emotion Map (E2Map), which integrates not only LLM knowledge but also the agent’s real-world experiences, drawing inspiration from human emotional responses. The proposed methodology enables one-shot behavior adjustments by updating the E2Map based on the agent’s experiences. Our evaluation in stochastic navigation environments, including both simulations and real-world scenarios, demonstrates that the proposed method significantly enhances performance in stochastic environments compared to existing LLM-based approaches. Code and supplementary materials are available at this https URL.

[AI-51] AceParse: A Comprehensive Dataset with Diverse Structured Texts for Academic Literature Parsing

链接: https://arxiv.org/abs/2409.10016
作者: Huawei Ji,Cheng Deng,Bo Xue,Zhouyang Jin,Jiaxin Ding,Xiaoying Gan,Luoyi Fu,Xinbing Wang,Chenghu Zhou
关键词-EN: improving data quality, data quality, Academic literature, development of data-centric, focus has shifted
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 5 pages, 3 figures, 3 tables

点击查看摘要

Abstract:With the development of data-centric AI, the focus has shifted from model-driven approaches to improving data quality. Academic literature, as one of the crucial types, is predominantly stored in PDF formats and needs to be parsed into texts before further processing. However, parsing diverse structured texts in academic literature remains challenging due to the lack of datasets that cover various text structures. In this paper, we introduce AceParse, the first comprehensive dataset designed to support the parsing of a wide range of structured texts, including formulas, tables, lists, algorithms, and sentences with embedded mathematical expressions. Based on AceParse, we fine-tuned a multimodal model, named AceParser, which accurately parses various structured texts within academic literature. This model outperforms the previous state-of-the-art by 4.1% in terms of F1 score and by 5% in Jaccard Similarity, demonstrating the potential of multimodal models in academic literature parsing. Our dataset is available at this https URL.

[AI-52] HALO: Hallucination Analysis and Learning Optimization to Empower LLMs with Retrieval-Augmented Context for Guided Clinical Decision Making

链接: https://arxiv.org/abs/2409.10011
作者: Sumera Anjum,Hanzhi Zhang,Wenjun Zhou,Eun Jin Paek,Xiaopeng Zhao,Yunhe Feng
关键词-EN: Large language models, language processing tasks, advanced natural language, natural language processing, significantly advanced natural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 10 pages, 4 figures

点击查看摘要

Abstract:Large language models (LLMs) have significantly advanced natural language processing tasks, yet they are susceptible to generating inaccurate or unreliable responses, a phenomenon known as hallucination. In critical domains such as health and medicine, these hallucinations can pose serious risks. This paper introduces HALO, a novel framework designed to enhance the accuracy and reliability of medical question-answering (QA) systems by focusing on the detection and mitigation of hallucinations. Our approach generates multiple variations of a given query using LLMs and retrieves relevant information from external open knowledge bases to enrich the context. We utilize maximum marginal relevance scoring to prioritize the retrieved context, which is then provided to LLMs for answer generation, thereby reducing the risk of hallucinations. The integration of LangChain further streamlines this process, resulting in a notable and robust increase in the accuracy of both open-source and commercial LLMs, such as Llama-3.1 (from 44% to 65%) and ChatGPT (from 56% to 70%). This framework underscores the critical importance of addressing hallucinations in medical QA systems, ultimately improving clinical decision-making and patient care. The open-source HALO is available at: this https URL.

[AI-53] SelECT-SQL: Self-correcting ensemble Chain-of-Thought for Text-to-SQL

链接: https://arxiv.org/abs/2409.10007
作者: Ke Shen,Mayank Kejriwal
关键词-EN: data management research, formal SQL queries, natural language processing, converting questions posed, automatically converting questions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years,Text-to-SQL, the problem of automatically converting questions posed in natural language to formal SQL queries, has emerged as an important problem at the intersection of natural language processing and data management research. Large language models (LLMs) have delivered impressive performance when used in an off-the-shelf performance, but still fall significantly short of expected expert-level performance. Errors are especially probable when a nuanced understanding is needed of database schemas, questions, and SQL clauses to do proper Text-to-SQL conversion. We introduce SelECT-SQL, a novel in-context learning solution that uses an algorithmic combination of chain-of-thought (CoT) prompting, self-correction, and ensemble methods to yield a new state-of-the-art result on challenging Text-to-SQL benchmarks. Specifically, when configured using GPT-3.5-Turbo as the base LLM, SelECT-SQL achieves 84.2% execution accuracy on the Spider leaderboard’s development set, exceeding both the best results of other baseline GPT-3.5-Turbo-based solutions (81.1%), and the peak performance (83.5%) of the GPT-4 result reported on the leaderboard.

[AI-54] FreeMark: A Non-Invasive White-Box Watermarking for Deep Neural Networks

链接: https://arxiv.org/abs/2409.09996
作者: Yuzhang Chen,Jiangnan Zhu,Yujie Gu,Minoru Kuribayashi,Kouichi Sakurai
关键词-EN: Deep neural networks, achieved significant success, Deep neural, neural networks, real-world applications
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) have achieved significant success in real-world applications. However, safeguarding their intellectual property (IP) remains extremely challenging. Existing DNN watermarking for IP protection often require modifying DNN models, which reduces model performance and limits their practicality. This paper introduces FreeMark, a novel DNN watermarking framework that leverages cryptographic principles without altering the original host DNN model, thereby avoiding any reduction in model performance. Unlike traditional DNN watermarking methods, FreeMark innovatively generates secret keys from a pre-generated watermark vector and the host model using gradient descent. These secret keys, used to extract watermark from the model’s activation values, are securely stored with a trusted third party, enabling reliable watermark extraction from suspect models. Extensive experiments demonstrate that FreeMark effectively resists various watermark removal attacks while maintaining high watermark capacity. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2409.09996 [cs.CR] (or arXiv:2409.09996v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2409.09996 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-55] Comprehensive Study on Sentiment Analysis: From Rule-based to modern LLM based system

链接: https://arxiv.org/abs/2409.09989
作者: Shailja Gupta,Rajesh Ranjan,Surya Narayan Singh
关键词-EN: large language models, sentiment analysis, artificial intelligence, large language, deep learning models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: 2 Images

点击查看摘要

Abstract:This paper provides a comprehensive survey of sentiment analysis within the context of artificial intelligence (AI) and large language models (LLMs). Sentiment analysis, a critical aspect of natural language processing (NLP), has evolved significantly from traditional rule-based methods to advanced deep learning techniques. This study examines the historical development of sentiment analysis, highlighting the transition from lexicon-based and pattern-based approaches to more sophisticated machine learning and deep learning models. Key challenges are discussed, including handling bilingual texts, detecting sarcasm, and addressing biases. The paper reviews state-of-the-art approaches, identifies emerging trends, and outlines future research directions to advance the field. By synthesizing current methodologies and exploring future opportunities, this survey aims to understand sentiment analysis in the AI and LLM context thoroughly.

[AI-56] Artificial Intelligence-Based Opportunistic Coronary Calcium Screening in the Veterans Affairs National Healthcare System

链接: https://arxiv.org/abs/2409.09968
作者: Raffi Hagopian,Timothy Strebel,Simon Bernatz,Gregory A Myers,Erik Offerman,Eric Zuniga,Cy Y Kim,Angie T Ng,James A Iwaz,Sunny P Singh,Evan P Carey,Michael J Kim,R Spencer Schaefer,Jeannie Yu,Amilcare Gentili,Hugo JWL Aerts
关键词-EN: Coronary artery calcium, Coronary artery, CAC, artery calcium, cardiovascular events
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Coronary artery calcium (CAC) is highly predictive of cardiovascular events. While millions of chest CT scans are performed annually in the United States, CAC is not routinely quantified from scans done for non-cardiac purposes. A deep learning algorithm was developed using 446 expert segmentations to automatically quantify CAC on non-contrast, non-gated CT scans (AI-CAC). Our study differs from prior works as we leverage imaging data across the Veterans Affairs national healthcare system, from 98 medical centers, capturing extensive heterogeneity in imaging protocols, scanners, and patients. AI-CAC performance on non-gated scans was compared against clinical standard ECG-gated CAC scoring. Non-gated AI-CAC differentiated zero vs. non-zero and less than 100 vs. 100 or greater Agatston scores with accuracies of 89.4% (F1 0.93) and 87.3% (F1 0.89), respectively, in 795 patients with paired gated scans within a year of a non-gated CT scan. Non-gated AI-CAC was predictive of 10-year all-cause mortality (CAC 0 vs. 400 group: 25.4% vs. 60.2%, Cox HR 3.49, p 0.005), and composite first-time stroke, MI, or death (CAC 0 vs. 400 group: 33.5% vs. 63.8%, Cox HR 3.00, p 0.005). In a screening dataset of 8,052 patients with low-dose lung cancer-screening CTs (LDCT), 3,091/8,052 (38.4%) individuals had AI-CAC 400. Four cardiologists qualitatively reviewed LDCT images from a random sample of 400 AI-CAC patients and verified that 527/531 (99.2%) would benefit from lipid-lowering therapy. To the best of our knowledge, this is the first non-gated CT CAC algorithm developed across a national healthcare system, on multiple imaging protocols, without filtering intra-cardiac hardware, and compared against a strong gated CT reference. We report superior performance relative to previous CAC algorithms evaluated against paired gated scans that included patients with intra-cardiac hardware.

[AI-57] An Offline Adaptation Framework for Constrained Multi-Objective Reinforcement Learning

链接: https://arxiv.org/abs/2409.09958
作者: Qian Lin,Zongkai Liu,Danying Mo,Chao Yu
关键词-EN: multi-objective reinforcement learning, balance multiple objectives, recent years, significant progress, reinforcement learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, significant progress has been made in multi-objective reinforcement learning (RL) research, which aims to balance multiple objectives by incorporating preferences for each objective. In most existing studies, specific preferences must be provided during deployment to indicate the desired policies explicitly. However, designing these preferences depends heavily on human prior knowledge, which is typically obtained through extensive observation of high-performing demonstrations with expected behaviors. In this work, we propose a simple yet effective offline adaptation framework for multi-objective RL problems without assuming handcrafted target preferences, but only given several demonstrations to implicitly indicate the preferences of expected policies. Additionally, we demonstrate that our framework can naturally be extended to meet constraints on safety-critical objectives by utilizing safe demonstrations, even when the safety thresholds are unknown. Empirical results on offline multi-objective and safe tasks demonstrate the capability of our framework to infer policies that align with real preferences while meeting the constraints implied by the provided demonstrations.

[AI-58] Deep Graph Anomaly Detection: A Survey and New Perspectives

链接: https://arxiv.org/abs/2409.09957
作者: Hezhe Qiao,Hanghang Tong,Bo An,Irwin King,Charu Aggarwal,Guansong Pang
关键词-EN: attracted increasing attention, recent years due, unusual graph instances, identify unusual graph, GAD
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 24 pages, 6 figures, and 7 tables

点击查看摘要

Abstract:Graph anomaly detection (GAD), which aims to identify unusual graph instances (nodes, edges, subgraphs, or graphs), has attracted increasing attention in recent years due to its significance in a wide range of applications. Deep learning approaches, graph neural networks (GNNs) in particular, have been emerging as a promising paradigm for GAD, owing to its strong capability in capturing complex structure and/or node attributes in graph data. Considering the large number of methods proposed for GNN-based GAD, it is of paramount importance to summarize the methodologies and findings in the existing GAD studies, so that we can pinpoint effective model designs for tackling open GAD problems. To this end, in this work we aim to present a comprehensive review of deep learning approaches for GAD. Existing GAD surveys are focused on task-specific discussions, making it difficult to understand the technical insights of existing methods and their limitations in addressing some unique challenges in GAD. To fill this gap, we first discuss the problem complexities and their resulting challenges in GAD, and then provide a systematic review of current deep GAD methods from three novel perspectives of methodology, including GNN backbone design, proxy task design for GAD, and graph anomaly measures. To deepen the discussions, we further propose a taxonomy of 13 fine-grained method categories under these three perspectives to provide more in-depth insights into the model designs and their capabilities. To facilitate the experiments and validation, we also summarize a collection of widely-used GAD datasets and empirical comparison. We further discuss multiple open problems to inspire more future high-quality research. A continuously updated repository for datasets, links to the codes of algorithms, and empirical comparison is available at this https URL.

[AI-59] Fault Analysis And Predictive Maintenance Of Induction Motor Using Machine Learning

链接: https://arxiv.org/abs/2409.09944
作者: Kavana Venkatesh,Neethi M
关键词-EN: crucial electrical equipment, range of applications, induction motor, wide range, induction motor faults
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Presented at ICEECCOT-2018, Published in IEEE Xplore, 6 pages, 3 figures

点击查看摘要

Abstract:Induction motors are one of the most crucial electrical equipment and are extensively used in industries in a wide range of applications. This paper presents a machine learning model for the fault detection and classification of induction motor faults by using three phase voltages and currents as inputs. The aim of this work is to protect vital electrical components and to prevent abnormal event progression through early detection and diagnosis. This work presents a fast forward artificial neural network model to detect some of the commonly occurring electrical faults like overvoltage, under voltage, single phasing, unbalanced voltage, overload, ground fault. A separate model free monitoring system wherein the motor itself acts like a sensor is presented and the only monitored signals are the input given to the motor. Limits for current and voltage values are set for the faulty and healthy conditions, which is done by a classifier. Real time data from a 0.33 HP induction motor is used to train and test the neural network. The model so developed analyses the voltage and current values given at a particular instant and classifies the data into no fault or the specific fault. The model is then interfaced with a real motor to accurately detect and classify the faults so that further necessary action can be taken.

[AI-60] owards Data Contamination Detection for Modern Large Language Models : Limitations Inconsistencies and Oracle Challenges

链接: https://arxiv.org/abs/2409.09927
作者: Vinay Samuel,Yue Zhou,Henry Peng Zou
关键词-EN: increasingly impressive results, large language models, language models achieve, models achieve increasingly, achieve increasingly impressive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 12 pages, 1 figure

点击查看摘要

Abstract:As large language models achieve increasingly impressive results, questions arise about whether such performance is from generalizability or mere data memorization. Thus, numerous data contamination detection methods have been proposed. However, these approaches are often validated with traditional benchmarks and early-stage LLMs, leaving uncertainty about their effectiveness when evaluating state-of-the-art LLMs on the contamination of more challenging benchmarks. To address this gap and provide a dual investigation of SOTA LLM contamination status and detection method robustness, we evaluate five contamination detection approaches with four state-of-the-art LLMs across eight challenging datasets often used in modern LLM evaluation. Our analysis reveals that (1) Current methods have non-trivial limitations in their assumptions and practical applications; (2) Notable difficulties exist in detecting contamination introduced during instruction fine-tuning with answer augmentation; and (3) Limited consistencies between SOTA contamination detection techniques. These findings highlight the complexity of contamination detection in advanced LLMs and the urgent need for further research on robust and generalizable contamination evaluation. Our code is available at this https URL.

[AI-61] SFR-RAG: Towards Contextually Faithful LLMs

链接: https://arxiv.org/abs/2409.09916
作者: Xuan-Phi Nguyen,Shrey Pandit,Senthil Purushwalkam,Austin Xu,Hailin Chen,Yifei Ming,Zixuan Ke,Silvio Savarese,Caiming Xong,Shafiq Joty
关键词-EN: Retrieval Augmented Generation, Retrieval Augmented, enhance factual accuracy, integrates external contextual, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Technical report

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG), a paradigm that integrates external contextual information with large language models (LLMs) to enhance factual accuracy and relevance, has emerged as a pivotal area in generative AI. The LLMs used in RAG applications are required to faithfully and completely comprehend the provided context and users’ questions, avoid hallucination, handle unanswerable, counterfactual or otherwise low-quality and irrelevant contexts, perform complex multi-hop reasoning and produce reliable citations. In this paper, we introduce SFR-RAG, a small LLM that is instruction-tuned with an emphasis on context-grounded generation and hallucination minimization. We also present ContextualBench, a new evaluation framework compiling multiple popular and diverse RAG benchmarks, such as HotpotQA and TriviaQA, with consistent RAG settings to ensure reproducibility and consistency in model assessments. Experimental results demonstrate that our SFR-RAG-9B model outperforms leading baselines such as Command-R+ (104B) and GPT-4o, achieving state-of-the-art results in 3 out of 7 benchmarks in ContextualBench with significantly fewer parameters. The model is also shown to be resilient to alteration in the contextual information and behave appropriately when relevant context is removed. Additionally, the SFR-RAG model maintains competitive performance in general instruction-following tasks and function-calling capabilities.

[AI-62] REG: Refined Generalized Focal Loss for Road Asset Detection on Thai Highways Using Vision-Based Detection and Segmentation Models

链接: https://arxiv.org/abs/2409.09877
作者: Teerapong Panboonyuen
关键词-EN: Refined Generalized Focal, Generalized Focal Loss, advanced Refined Generalized, Refined Generalized, Generalized Focal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 14 pages

点击查看摘要

Abstract:This paper introduces a novel framework for detecting and segmenting critical road assets on Thai highways using an advanced Refined Generalized Focal Loss (REG) formulation. Integrated into state-of-the-art vision-based detection and segmentation models, the proposed method effectively addresses class imbalance and the challenges of localizing small, underrepresented road elements, including pavilions, pedestrian bridges, information signs, single-arm poles, bus stops, warning signs, and concrete guardrails. To improve both detection and segmentation accuracy, a multi-task learning strategy is adopted, optimizing REG across multiple tasks. REG is further enhanced by incorporating a spatial-contextual adjustment term, which accounts for the spatial distribution of road assets, and a probabilistic refinement that captures prediction uncertainty in complex environments, such as varying lighting conditions and cluttered backgrounds. Our rigorous mathematical formulation demonstrates that REG minimizes localization and classification errors by applying adaptive weighting to hard-to-detect instances while down-weighting easier examples. Experimental results show a substantial performance improvement, achieving a mAP50 of 80.34 and an F1-score of 77.87, significantly outperforming conventional methods. This research underscores the capability of advanced loss function refinements to enhance the robustness and accuracy of road asset detection and segmentation, thereby contributing to improved road safety and infrastructure management. For an in-depth discussion of the mathematical background and related methods, please refer to previous work available at \urlthis https URL.

[AI-63] Critic as Lyapunov function (CALF): a model-free stability-ensuring agent

链接: https://arxiv.org/abs/2409.09869
作者: Pavel Osinenko,Grigory Yaremenko,Roman Zashchitin,Anton Bolychev,Sinan Ibrahim,Dmitrii Dobriborsci
关键词-EN: Lyapunov Function, Critic As Lyapunov, dynamical system stabilization, ensures online environment, agent called Critic
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注: IEEE Conference on Decision and Control. Accepted for publication in proceedings of the conference

点击查看摘要

Abstract:This work presents and showcases a novel reinforcement learning agent called Critic As Lyapunov Function (CALF) which is model-free and ensures online environment, in other words, dynamical system stabilization. Online means that in each learning episode, the said environment is stabilized. This, as demonstrated in a case study with a mobile robot simulator, greatly improves the overall learning performance. The base actor-critic scheme of CALF is analogous to SARSA. The latter did not show any success in reaching the target in our studies. However, a modified version thereof, called SARSA-m here, did succeed in some learning scenarios. Still, CALF greatly outperformed the said approach. CALF was also demonstrated to improve a nominal stabilizer provided to it. In summary, the presented agent may be considered a viable approach to fusing classical control with reinforcement learning. Its concurrent approaches are mostly either offline or model-based, like, for instance, those that fuse model-predictive control into the agent.

[AI-64] owards Kinetic Manipulation of the Latent Space

链接: https://arxiv.org/abs/2409.09867
作者: Diego Porres
关键词-EN: Graphical User Interfaces, valleys and mountains, Convolutional Neural Networks, generative models, models are rich
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The latent space of many generative models are rich in unexplored valleys and mountains. The majority of tools used for exploring them are so far limited to Graphical User Interfaces (GUIs). While specialized hardware can be used for this task, we show that a simple feature extraction of pre-trained Convolutional Neural Networks (CNNs) from a live RGB camera feed does a very good job at manipulating the latent space with simple changes in the scene, with vast room for improvement. We name this new paradigm Visual-reactive Interpolation, and the full code can be found at this https URL.

[AI-65] Constructing a Singing Style Caption Dataset

链接: https://arxiv.org/abs/2409.09866
作者: Hyunjong Ok,Jaeho Lee
关键词-EN: Singing voice synthesis, voice generation, synthesis and conversion, conversion have emerged, emerged as significant
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Preprint

点击查看摘要

Abstract:Singing voice synthesis and conversion have emerged as significant subdomains of voice generation, leading to much demands on prompt-conditioned generation. Unlike common voice data, generating a singing voice requires an understanding of various associated vocal and musical characteristics, such as the vocal tone of the singer or emotional expressions. However, existing open-source audio-text datasets for voice generation tend to capture only a very limited range of attributes, often missing musical characteristics of the audio. To fill this gap, we introduce S2Cap, an audio-text pair dataset with a diverse set of attributes. S2Cap consists of pairs of textual prompts and music audio samples with a wide range of vocal and musical attributes, including pitch, volume, tempo, mood, singer’s gender and age, and musical genre and emotional expression. Utilizing S2Cap, we suggest an effective novel baseline algorithm for singing style captioning. Singing style captioning is a relative task to voice generation that generates text descriptions of vocal characteristics, which we first suggested. First, to mitigate the misalignment between the audio encoder and the text decoder, we present a novel mechanism called CRESCENDO, which utilizes positive-pair similarity learning to synchronize the embedding spaces of a pretrained audio encoder to get similar embeddings with a text encoder. We additionally supervise the model using the singer’s voice, which is demixed by the accompaniment. This supervision allows the model to more accurately capture vocal characteristics, leading to improved singing style captions that better reflect the style of the singer. The dataset and the codes are available at \bulurlthis https URL.

[AI-66] A Survey of Out-of-distribution Generalization for Graph Machine Learning from a Causal View

链接: https://arxiv.org/abs/2409.09858
作者: Jing Ma
关键词-EN: range of tasks, successfully applied, wide range, GML, Graph machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15 pages, 2 figures, 1 table

点击查看摘要

Abstract:Graph machine learning (GML) has been successfully applied across a wide range of tasks. Nonetheless, GML faces significant challenges in generalizing over out-of-distribution (OOD) data, which raises concerns about its wider applicability. Recent advancements have underscored the crucial role of causality-driven approaches in overcoming these generalization challenges. Distinct from traditional GML methods that primarily rely on statistical dependencies, causality-focused strategies delve into the underlying causal mechanisms of data generation and model prediction, thus significantly improving the generalization of GML across different environments. This paper offers a thorough review of recent progress in causality-involved GML generalization. We elucidate the fundamental concepts of employing causality to enhance graph model generalization and categorize the various approaches, providing detailed descriptions of their methodologies and the connections among them. Furthermore, we explore the incorporation of causality in other related important areas of trustworthy GML, such as explanation, fairness, and robustness. Concluding with a discussion on potential future research directions, this review seeks to articulate the continuing development and future potential of causality in enhancing the trustworthiness of graph machine learning.

[AI-67] Latent Diffusion Models for Controllable RNA Sequence Generation

链接: https://arxiv.org/abs/2409.09828
作者: Kaixuan Huang,Yukang Yang,Kaidi Fu,Yanyi Chu,Le Cong,Mengdi Wang
关键词-EN: optimizing discrete RNA, RNA, paper presents RNAdiffusion, RNA sequences, discrete RNA sequences
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:This paper presents RNAdiffusion, a latent diffusion model for generating and optimizing discrete RNA sequences. RNA is a particularly dynamic and versatile molecule in biological processes. RNA sequences exhibit high variability and diversity, characterized by their variable lengths, flexible three-dimensional structures, and diverse functions. We utilize pretrained BERT-type models to encode raw RNAs into token-level biologically meaningful representations. A Q-Former is employed to compress these representations into a fixed-length set of latent vectors, with an autoregressive decoder trained to reconstruct RNA sequences from these latent variables. We then develop a continuous diffusion model within this latent space. To enable optimization, we train reward networks to estimate functional properties of RNA from the latent variables. We employ gradient-based guidance during the backward diffusion process, aiming to generate RNA sequences that are optimized for higher rewards. Empirical experiments confirm that RNAdiffusion generates non-coding RNAs that align with natural distributions across various biological indicators. We fine-tuned the diffusion model on untranslated regions (UTRs) of mRNA and optimize sample sequences for protein translation efficiencies. Our guided diffusion model effectively generates diverse UTR sequences with high Mean Ribosome Loading (MRL) and Translation Efficiency (TE), surpassing baselines. These results hold promise for studies on RNA sequence-function relationships, protein synthesis, and enhancing therapeutic RNA design.

[AI-68] On the Effect of Robot Errors on Human Teaching Dynamics

链接: https://arxiv.org/abs/2409.09827
作者: Jindan Huang,Isaac Sheidlower,Reuben M. Aronson,Elaine Schaertl Short
关键词-EN: leverages human knowledge, facilitate agent learning, gaining popularity, field of robotics, knowledge about real-world
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: Accepted to 2024 International Conference on Human-Agent Interaction (HAI)

点击查看摘要

Abstract:Human-in-the-loop learning is gaining popularity, particularly in the field of robotics, because it leverages human knowledge about real-world tasks to facilitate agent learning. When people instruct robots, they naturally adapt their teaching behavior in response to changes in robot performance. While current research predominantly focuses on integrating human teaching dynamics from an algorithmic perspective, understanding these dynamics from a human-centered standpoint is an under-explored, yet fundamental problem. Addressing this issue will enhance both robot learning and user experience. Therefore, this paper explores one potential factor contributing to the dynamic nature of human teaching: robot errors. We conducted a user study to investigate how the presence and severity of robot errors affect three dimensions of human teaching dynamics: feedback granularity, feedback richness, and teaching time, in both forced-choice and open-ended teaching contexts. The results show that people tend to spend more time teaching robots with errors, provide more detailed feedback over specific segments of a robot’s trajectory, and that robot error can influence a teacher’s choice of feedback modality. Our findings offer valuable insights for designing effective interfaces for interactive learning and optimizing algorithms to better understand human intentions.

[AI-69] GP-GPT: Large Language Model for Gene-Phenotype Mapping

链接: https://arxiv.org/abs/2409.09825
作者: Yanjun Lyu,Zihao Wu,Lu Zhang,Jing Zhang,Yiwei Li,Wei Ruan,Zhengliang Liu,Xiaowei Yu,Chao Cao,Tong Chen,Minheng Chen,Yan Zhuang,Xiang Li,Rongjie Liu,Chao Huang,Wentao Li,Tianming Liu,Dajiang Zhu
关键词-EN: Pre-trained large language, attracted increasing attention, natural language processing, biomedical domains due, Pre-trained large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Pre-trained large language models(LLMs) have attracted increasing attention in biomedical domains due to their success in natural language processing. However, the complex traits and heterogeneity of multi-sources genomics data pose significant challenges when adapting these models to the bioinformatics and biomedical field. To address these challenges, we present GP-GPT, the first specialized large language model for genetic-phenotype knowledge representation and genomics relation analysis. Our model is fine-tuned in two stages on a comprehensive corpus composed of over 3,000,000 terms in genomics, proteomics, and medical genetics, derived from multiple large-scale validated datasets and scientific publications. GP-GPT demonstrates proficiency in accurately retrieving medical genetics information and performing common genomics analysis tasks, such as genomics information retrieval and relationship determination. Comparative experiments across domain-specific tasks reveal that GP-GPT outperforms state-of-the-art LLMs, including Llama2, Llama3 and GPT-4. These results highlight GP-GPT’s potential to enhance genetic disease relation research and facilitate accurate and efficient analysis in the fields of genomics and medical genetics. Our investigation demonstrated the subtle changes of bio-factor entities’ representations in the GP-GPT, which suggested the opportunities for the application of LLMs to advancing gene-phenotype research.

[AI-70] Causal Inference with Large Language Model: A Survey

链接: https://arxiv.org/abs/2409.09822
作者: Jing Ma
关键词-EN: data mining capabilities, mathematical reasoning, medicine and economics, demanding a complicated, human knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 15 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Causal inference has been a pivotal challenge across diverse domains such as medicine and economics, demanding a complicated integration of human knowledge, mathematical reasoning, and data mining capabilities. Recent advancements in natural language processing (NLP), particularly with the advent of large language models (LLMs), have introduced promising opportunities for traditional causal inference tasks. This paper reviews recent progress in applying LLMs to causal inference, encompassing various tasks spanning different levels of causation. We summarize the main causal problems and approaches, and present a comparison of their evaluation results in different causal scenarios. Furthermore, we discuss key findings and outline directions for future research, underscoring the potential implications of integrating LLMs in advancing causal inference methodologies.

[AI-71] Famba-V: Fast Vision Mamba with Cross-Layer Token Fusion ECCV2024

链接: https://arxiv.org/abs/2409.09808
作者: Hui Shen,Zhongwei Wan,Xin Wang,Mi Zhang
关键词-EN: Transformer architecture, introduces Fast Mamba, Vision Mamba, based on Transformer, Vim models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Camera ready version of ECCV 2024 The Fourth Workshop on Computational Aspects of Deep Learning

点击查看摘要

Abstract:Mamba and Vision Mamba (Vim) models have shown their potential as an alternative to methods based on Transformer architecture. This work introduces Fast Mamba for Vision (Famba-V), a cross-layer token fusion technique to enhance the training efficiency of Vim models. The key idea of Famba-V is to identify and fuse similar tokens across different Vim layers based on a suit of cross-layer strategies instead of simply applying token fusion uniformly across all the layers that existing works propose. We evaluate the performance of Famba-V on CIFAR-100. Our results show that Famba-V is able to enhance the training efficiency of Vim models by reducing both training time and peak memory usage during training. Moreover, the proposed cross-layer strategies allow Famba-V to deliver superior accuracy-efficiency trade-offs. These results all together demonstrate Famba-V as a promising efficiency enhancement technique for Vim models.

[AI-72] Abnormal Event Detection In Videos Using Deep Embedding

链接: https://arxiv.org/abs/2409.09804
作者: Darshan Venkatrayappa
关键词-EN: Abnormal event detection, anomaly detection, Abnormal event, video anomaly detection, anomaly detection requires
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Abnormal event detection or anomaly detection in surveillance videos is currently a challenge because of the diversity of possible events. Due to the lack of anomalous events at training time, anomaly detection requires the design of learning methods without supervision. In this work we propose an unsupervised approach for video anomaly detection with the aim to jointly optimize the objectives of the deep neural network and the anomaly detection task using a hybrid architecture. Initially, a convolutional autoencoder is pre-trained in an unsupervised manner with a fusion of depth, motion and appearance features. In the second step, we utilize the encoder part of the pre-trained autoencoder and extract the embeddings of the fused input. Now, we jointly train/ fine tune the encoder to map the embeddings to a hypercenter. Thus, embeddings of normal data fall near the hypercenter, whereas embeddings of anomalous data fall far away from the hypercenter.

[AI-73] Multiple Rotation Averaging with Constrained Reweighting Deep Matrix Factorization

链接: https://arxiv.org/abs/2409.09790
作者: Shiqi Li,Jihua Zhu,Yifan Xie,Naiwen Hu,Mingchen Zhu,Zhongyu Li,Di Wang
关键词-EN: Multiple rotation averaging, rotation averaging plays, rotation averaging, robotics domains, plays a crucial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Multiple rotation averaging plays a crucial role in computer vision and robotics domains. The conventional optimization-based methods optimize a nonlinear cost function based on certain noise assumptions, while most previous learning-based methods require ground truth labels in the supervised training process. Recognizing the handcrafted noise assumption may not be reasonable in all real-world scenarios, this paper proposes an effective rotation averaging method for mining data patterns in a learning manner while avoiding the requirement of labels. Specifically, we apply deep matrix factorization to directly solve the multiple rotation averaging problem in unconstrained linear space. For deep matrix factorization, we design a neural network model, which is explicitly low-rank and symmetric to better suit the background of multiple rotation averaging. Meanwhile, we utilize a spanning tree-based edge filtering to suppress the influence of rotation outliers. What’s more, we also adopt a reweighting scheme and dynamic depth selection strategy to further improve the robustness. Our method synthesizes the merit of both optimization-based and learning-based methods. Experimental results on various datasets validate the effectiveness of our proposed method.

[AI-74] BEnDEM:A Boltzmann Sampler Based on Bootstrapped Denoising Energy Matching

链接: https://arxiv.org/abs/2409.09787
作者: RuiKang OuYang,Bo Qiang,José Miguel Hernández-Lobato
关键词-EN: Developing an efficient, efficient sampler capable, Boltzmann distribution, molecular dynamics, identically distributed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation (stat.CO); Machine Learning (stat.ML)
*备注: 20 pages, 7 figures, 2 tables

点击查看摘要

Abstract:Developing an efficient sampler capable of generating independent and identically distributed (IID) samples from a Boltzmann distribution is a crucial challenge in scientific research, e.g. molecular dynamics. In this work, we intend to learn neural samplers given energy functions instead of data sampled from the Boltzmann distribution. By learning the energies of the noised data, we propose a diffusion-based sampler, ENERGY-BASED DENOISING ENERGY MATCHING, which theoretically has lower variance and more complexity compared to related works. Furthermore, a novel bootstrapping technique is applied to EnDEM to balance between bias and variance. We evaluate EnDEM and BEnDEM on a 2-dimensional 40 Gaussian Mixture Model (GMM) and a 4-particle double-welling potential (DW-4). The experimental results demonstrate that BEnDEM can achieve state-of-the-art performance while being more robust.

[AI-75] Large Language Model Based Generative Error Correction: A Challenge and Baselines forSpeech Recognition Speaker Tagging and Emotion Recognition

链接: https://arxiv.org/abs/2409.09785
作者: Chao-Han Huck Yang,Taejin Park,Yuan Gong,Yuanchao Li,Zhehuai Chen,Yen-Ting Lin,Chen Chen,Yuchen Hu,Kunal Dhawan,Piotr Żelasko,Chao Zhang,Yun-Nung Chen,Yu Tsao,Jagadeesh Balam,Boris Ginsburg,Sabato Marco Siniscalchi,Eng Siong Chng,Peter Bell,Catherine Lai,Shinji Watanabe,Andreas Stolcke
关键词-EN: text decoding results, enhance acoustic modeling, automatic speech recognition, ASR, recent advances
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: IEEE SLT 2024. The initial draft version has been done in December 2023. Post-ASR Text Processing and Understanding Community: this https URL

点击查看摘要

Abstract:Given recent advances in generative AI technology, a key question is how large language models (LLMs) can enhance acoustic modeling tasks using text decoding results from a frozen, pretrained automatic speech recognition (ASR) model. To explore new capabilities in language modeling for speech processing, we introduce the generative speech transcription error correction (GenSEC) challenge. This challenge comprises three post-ASR language modeling tasks: (i) post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion recognition. These tasks aim to emulate future LLM-based agents handling voice-based interfaces while remaining accessible to a broad audience by utilizing open pretrained language models or agent-based APIs. We also discuss insights from baseline evaluations, as well as lessons learned for designing future evaluations.

[AI-76] Enhancing Lesion Segmentation in PET/CT Imaging with Deep Learning and Advanced Data Preprocessing Techniques

链接: https://arxiv.org/abs/2409.09784
作者: Jiayi Liu,Qiaoyi Xue,Youdan Feng,Tianming Xu,Kaixin Shen,Chuyun Shen,Yuhang Shi
关键词-EN: escalating global cancer, global cancer burden, cancer burden underscores, precise diagnostic tools, tools in oncology
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The escalating global cancer burden underscores the critical need for precise diagnostic tools in oncology. This research employs deep learning to enhance lesion segmentation in PET/CT imaging, utilizing a dataset of 900 whole-body FDG-PET/CT and 600 PSMA-PET/CT studies from the AutoPET challenge III. Our methodical approach includes robust preprocessing and data augmentation techniques to ensure model robustness and generalizability. We investigate the influence of non-zero normalization and modifications to the data augmentation pipeline, such as the introduction of RandGaussianSharpen and adjustments to the Gamma transform parameter. This study aims to contribute to the standardization of preprocessing and augmentation strategies in PET/CT imaging, potentially improving the diagnostic accuracy and the personalized management of cancer patients. Our code will be open-sourced and available at this https URL.

[AI-77] Automated Lesion Segmentation in Whole-Body PET/CT in a multitracer setting

链接: https://arxiv.org/abs/2409.09766
作者: Qiaoyi Xue,Youdan Feng,Jiayi Liu,Tianming Xu,Kaixin Shen,Chuyun Shen,Yuhang Shi
关键词-EN: FDG and PSMA, PSMA PET, FDG, PSMA, PSMA images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study explores a workflow for automated segmentation of lesions in FDG and PSMA PET/CT images. Due to the substantial differences in image characteristics between FDG and PSMA, specialized preprocessing steps are required. Utilizing YOLOv8 for data classification, the FDG and PSMA images are preprocessed separately before feeding them into the segmentation models, aiming to improve lesion segmentation accuracy. The study focuses on evaluating the performance of automated segmentation workflow for multitracer PET images. The findings are expected to provide critical insights for enhancing diagnostic workflows and patient-specific treatment plans. Our code will be open-sourced and available at this https URL.

[AI-78] ELMI: Interactive and Intelligent Sign Language Translation of Lyrics for Song Signing

链接: https://arxiv.org/abs/2409.09760
作者: Suhyeon Yoo,Khai N. Truong,Young-Ho Kim
关键词-EN: Deaf and hearing, language remains cumbersome, sign language remains, video-sharing platforms, cumbersome and inaccessible
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 18 pages excluding reference and appendix

点击查看摘要

Abstract:d/Deaf and hearing song-signers become prevalent on video-sharing platforms, but translating songs into sign language remains cumbersome and inaccessible. Our formative study revealed the challenges song-signers face, including semantic, syntactic, expressive, and rhythmic considerations in translations. We present ELMI, an accessible song-signing tool that assists in translating lyrics into sign language. ELMI enables users to edit glosses line-by-line, with real-time synced lyric highlighting and music video snippets. Users can also chat with a large language model-driven AI to discuss meaning, glossing, emoting, and timing. Through an exploratory study with 13 song-signers, we examined how ELMI facilitates their workflows and how song-signers leverage and receive an LLM-driven chat for translation. Participants successfully adopted ELMI to song-signing, with active discussions on the fly. They also reported improved confidence and independence in their translations, finding ELMI encouraging, constructive, and informative. We discuss design implications for leveraging LLMs in culturally sensitive song-signing translations.

[AI-79] Explore the Hallucination on Low-level Perception for MLLMs

链接: https://arxiv.org/abs/2409.09748
作者: Yinan Sun,Zicheng Zhang,Haoning Wu,Xiaohong Liu,Weisi Lin,Guangtao Zhai,Xiongkuo Min
关键词-EN: Multi-modality Large Language, Large Language Models, Multi-modality Large, Large Language, development of Multi-modality
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid development of Multi-modality Large Language Models (MLLMs) has significantly influenced various aspects of industry and daily life, showcasing impressive capabilities in visual perception and understanding. However, these models also exhibit hallucinations, which limit their reliability as AI systems, especially in tasks involving low-level visual perception and understanding. We believe that hallucinations stem from a lack of explicit self-awareness in these models, which directly impacts their overall performance. In this paper, we aim to define and evaluate the self-awareness of MLLMs in low-level visual perception and understanding tasks. To this end, we present QL-Bench, a benchmark settings to simulate human responses to low-level vision, investigating self-awareness in low-level visual perception through visual question answering related to low-level attributes such as clarity and lighting. Specifically, we construct the LLSAVisionQA dataset, comprising 2,990 single images and 1,999 image pairs, each accompanied by an open-ended question about its low-level features. Through the evaluation of 15 MLLMs, we demonstrate that while some models exhibit robust low-level visual capabilities, their self-awareness remains relatively underdeveloped. Notably, for the same model, simpler questions are often answered more accurately than complex ones. However, self-awareness appears to improve when addressing more challenging questions. We hope that our benchmark will motivate further research, particularly focused on enhancing the self-awareness of MLLMs in tasks involving low-level visual perception and understanding.

[AI-80] Benchmarking LLMs in Political Content Text-Annotation: Proof-of-Concept with Toxicity and Incivility Data

链接: https://arxiv.org/abs/2409.09741
作者: Bastián González-Bustamante
关键词-EN: political content, Nous Hermes, article benchmarked, benchmarked the ability, ability of OpenAI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Paper prepared for delivery at the 8th Monash-Warwick-Zurich Text-as-Data Workshop, September 16-17, 2024: 11 pages, 3 tables, 3 figures

点击查看摘要

Abstract:This article benchmarked the ability of OpenAI’s GPTs and a number of open-source LLMs to perform annotation tasks on political content. We used a novel protest event dataset comprising more than three million digital interactions and created a gold standard that includes ground-truth labels annotated by human coders about toxicity and incivility on social media. We included in our benchmark Google’s Perspective algorithm, which, along with GPTs, was employed throughout their respective APIs while the open-source LLMs were deployed locally. The findings show that Perspective API using a laxer threshold, GPT-4o, and Nous Hermes 2 Mixtral outperform other LLM’s zero-shot classification annotations. In addition, Nous Hermes 2 and Mistral OpenOrca, with a smaller number of parameters, are able to perform the task with high performance, being attractive options that could offer good trade-offs between performance, implementing costs and computing time. Ancillary findings using experiments setting different temperature levels show that although GPTs tend to show not only excellent computing time but also overall good levels of reliability, only open-source LLMs ensure full reproducibility in the annotation.

[AI-81] From Challenges and Pitfalls to Recommendations and Opportunities: Implementing Federated Learning in Healthcare

链接: https://arxiv.org/abs/2409.09727
作者: Ming Li,Pengcheng Xu,Junjie Hu,Zeyu Tang,Guang Yang
关键词-EN: holds great potential, learning holds great, enabling large-scale healthcare, large-scale healthcare research, Federated learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated learning holds great potential for enabling large-scale healthcare research and collaboration across multiple centres while ensuring data privacy and security are not compromised. Although numerous recent studies suggest or utilize federated learning based methods in healthcare, it remains unclear which ones have potential clinical utility. This review paper considers and analyzes the most recent studies up to May 2024 that describe federated learning based methods in healthcare. After a thorough review, we find that the vast majority are not appropriate for clinical use due to their methodological flaws and/or underlying biases which include but are not limited to privacy concerns, generalization issues, and communication costs. As a result, the effectiveness of federated learning in healthcare is significantly compromised. To overcome these challenges, we provide recommendations and promising opportunities that might be implemented to resolve these problems and improve the quality of model development in federated learning with healthcare.

[AI-82] Automatic Control With Human-Like Reasoning: Exploring Language Model Embodied Air Traffic Agents

链接: https://arxiv.org/abs/2409.09717
作者: Justas Andriuškevičius,Junzi Sun
关键词-EN: air traffic control, Recent developments, air traffic, traffic control studies, language models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Recent developments in language models have created new opportunities in air traffic control studies. The current focus is primarily on text and language-based use cases. However, these language models may offer a higher potential impact in the air traffic control domain, thanks to their ability to interact with air traffic environments in an embodied agent form. They also provide a language-like reasoning capability to explain their decisions, which has been a significant roadblock for the implementation of automatic air traffic control. This paper investigates the application of a language model-based agent with function-calling and learning capabilities to resolve air traffic conflicts without human intervention. The main components of this research are foundational large language models, tools that allow the agent to interact with the simulator, and a new concept, the experience library. An innovative part of this research, the experience library, is a vector database that stores synthesized knowledge that agents have learned from interactions with the simulations and language models. To evaluate the performance of our language model-based agent, both open-source and closed-source models were tested. The results of our study reveal significant differences in performance across various configurations of the language model-based agents. The best-performing configuration was able to solve almost all 120 but one imminent conflict scenarios, including up to four aircraft at the same time. Most importantly, the agents are able to provide human-level text explanations on traffic situations and conflict resolution strategies. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2409.09717 [cs.AI] (or arXiv:2409.09717v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2409.09717 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-83] Exploring Utility in a Real-World Warehouse Optimization Problem: Formulation Based on Quantun Annealers and Preliminary Results

链接: https://arxiv.org/abs/2409.09706
作者: Eneko Osaba,Esther Villar-Rodriguez,Antón Asla
关键词-EN: major challenges faced, D-Wave Quantum Annealer, Warehouse Optimization Problem, current NISQ-era, major challenges
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI)
*备注: 2 pages, 2 figures. Paper presented at the 5th IEEE International Conference on Quantum Computing and Engineering (IEEE QCE 2024)

点击查看摘要

Abstract:In the current NISQ-era, one of the major challenges faced by researchers and practitioners lies in figuring out how to combine quantum and classical computing in the most efficient and innovative way. In this paper, we present a mechanism coined as Quantum Initialization for Warehouse Optimization Problem that resorts to D-Wave’s Quantum Annealer. The module has been specifically designed to be embedded into already existing classical software dedicated to the optimization of a real-world industrial problem. We preliminary tested the implemented mechanism through a two-phase experiment against the classical version of the software.

[AI-84] GFlowNet Pretraining with Inexpensive Rewards

链接: https://arxiv.org/abs/2409.09702
作者: Mohit Pandey,Gopeshh Subbaraj,Emmanuel Bengio
关键词-EN: Generative Flow Networks, Flow Networks, unnormalized reward distributions, Generative Flow, high-quality molecular structures
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Generative Flow Networks (GFlowNets), a class of generative models have recently emerged as a suitable framework for generating diverse and high-quality molecular structures by learning from unnormalized reward distributions. Previous works in this direction often restrict exploration by using predefined molecular fragments as building blocks, limiting the chemical space that can be accessed. In this work, we introduce Atomic GFlowNets (A-GFNs), a foundational generative model leveraging individual atoms as building blocks to explore drug-like chemical space more comprehensively. We propose an unsupervised pre-training approach using offline drug-like molecule datasets, which conditions A-GFNs on inexpensive yet informative molecular descriptors such as drug-likeliness, topological polar surface area, and synthetic accessibility scores. These properties serve as proxy rewards, guiding A-GFNs towards regions of chemical space that exhibit desirable pharmacological properties. We further our method by implementing a goal-conditioned fine-tuning process, which adapts A-GFNs to optimize for specific target properties. In this work, we pretrain A-GFN on the ZINC15 offline dataset and employ robust evaluation metrics to show the effectiveness of our approach when compared to other relevant baseline methods in drug design.

[AI-85] raining Safe Neural Networks with Global SDP Bounds

链接: https://arxiv.org/abs/2409.09687
作者: Roman Soletskyi,David “davidad” Dalrymple
关键词-EN: formal safety guarantees, semidefinite programming, SDP, paper presents, guarantees using semidefinite
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:This paper presents a novel approach to training neural networks with formal safety guarantees using semidefinite programming (SDP) for verification. Our method focuses on verifying safety over large, high-dimensional input regions, addressing limitations of existing techniques that focus on adversarial robustness bounds. We introduce an ADMM-based training scheme for an accurate neural network classifier on the Adversarial Spheres dataset, achieving provably perfect recall with input dimensions up to d=40 . This work advances the development of reliable neural network verification methods for high-dimensional systems, with potential applications in safe RL policies.

[AI-86] ExploreSelf: Fostering User-driven Exploration and Reflection on Personal Challenges with Adaptive Guidance by Large Language Models

链接: https://arxiv.org/abs/2409.09662
作者: Inhwa Song,SoHyun Park,Sachin R. Pendse,Jessica Lee Schleider,Munmun De Choudhury,Young-Ho Kim
关键词-EN: Expressing stressful experiences, Expressing stressful, physical health, thoughts and emotions, stressful experiences
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 17 pages excluding reference and appendix

点击查看摘要

Abstract:Expressing stressful experiences in words is proven to improve mental and physical health, but individuals often disengage with writing interventions as they struggle to organize their thoughts and emotions. Reflective prompts have been used to provide direction, and large language models (LLMs) have demonstrated the potential to provide tailored guidance. Current systems often limit users’ flexibility to direct their reflections. We thus present ExploreSelf, an LLM-driven application designed to empower users to control their reflective journey. ExploreSelf allows users to receive adaptive support through dynamically generated questions. Through an exploratory study with 19 participants, we examine how participants explore and reflect on personal challenges using ExploreSelf. Our findings demonstrate that participants valued the balance between guided support and freedom to control their reflective journey, leading to deeper engagement and insight. Building on our findings, we discuss implications for designing LLM-driven tools that promote user empowerment through effective reflective practices.

[AI-87] KAN v.s. MLP for Offline Reinforcement Learning

链接: https://arxiv.org/abs/2409.09653
作者: Haihong Guo,Fengxin Li,Jiao Li,Hongyan Liu
关键词-EN: emerging neural network, neural network architecture, emerging neural, KAN, Kolmogorov-Arnold Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 5 pages,2 figures

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KAN) is an emerging neural network architecture in machine learning. It has greatly interested the research community about whether KAN can be a promising alternative of the commonly used Multi-Layer Perceptions (MLP). Experiments in various fields demonstrated that KAN-based machine learning can achieve comparable if not better performance than MLP-based methods, but with much smaller parameter scales and are more explainable. In this paper, we explore the incorporation of KAN into the actor and critic networks for offline reinforcement learning (RL). We evaluated the performance, parameter scales, and training efficiency of various KAN and MLP based conservative Q-learning (CQL) on the the classical D4RL benchmark for offline RL. Our study demonstrates that KAN can achieve performance close to the commonly used MLP with significantly fewer parameters. This provides us an option to choose the base networks according to the requirements of the offline RL tasks.

[AI-88] Self-supervised Learning for Acoustic Few-Shot Classification

链接: https://arxiv.org/abs/2409.09647
作者: Jingyong Liang,Bernd Meyer,Issac Ning Lee,Thanh-Toan Do
关键词-EN: important approaches, reducing labelling requirements, reducing labelling, data, actual task data
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Labelled data are limited and self-supervised learning is one of the most important approaches for reducing labelling requirements. While it has been extensively explored in the image domain, it has so far not received the same amount of attention in the acoustic domain. Yet, reducing labelling is a key requirement for many acoustic applications. Specifically in bioacoustic, there are rarely sufficient labels for fully supervised learning available. This has led to the widespread use of acoustic recognisers that have been pre-trained on unrelated data for bioacoustic tasks. We posit that training on the actual task data and combining self-supervised pre-training with few-shot classification is a superior approach that has the ability to deliver high accuracy even when only a few labels are available. To this end, we introduce and evaluate a new architecture that combines CNN-based preprocessing with feature extraction based on state space models (SSMs). This combination is motivated by the fact that CNN-based networks alone struggle to capture temporal information effectively, which is crucial for classifying acoustic signals. SSMs, specifically S4 and Mamba, on the other hand, have been shown to have an excellent ability to capture long-range dependencies in sequence data. We pre-train this architecture using contrastive learning on the actual task data and subsequent fine-tuning with an extremely small amount of labelled data. We evaluate the performance of this proposed architecture for ( n -shot, n -class) classification on standard benchmarks as well as real-world data. Our evaluation shows that it outperforms state-of-the-art architectures on the few-shot classification problem.

[AI-89] COSCO: A Sharpness-Aware Training Framework for Few-shot Multivariate Time Series Classification CIKM’24

链接: https://arxiv.org/abs/2409.09645
作者: Jesus Barreda,Ashley Gomez,Ruben Puga,Kaixiong Zhou,Li Zhang
关键词-EN: time series classification, Multivariate time series, time series, series classification, domains of applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 5 pages, 5 figures, CIKM '24 Short Paper Track

点击查看摘要

Abstract:Multivariate time series classification is an important task with widespread domains of applications. Recently, deep neural networks (DNN) have achieved state-of-the-art performance in time series classification. However, they often require large expert-labeled training datasets which can be infeasible in practice. In few-shot settings, i.e. only a limited number of samples per class are available in training data, DNNs show a significant drop in testing accuracy and poor generalization ability. In this paper, we propose to address these problems from an optimization and a loss function perspective. Specifically, we propose a new learning framework named COSCO consisting of a sharpness-aware minimization (SAM) optimization and a Prototypical loss function to improve the generalization ability of DNN for multivariate time series classification problems under few-shot setting. Our experiments demonstrate our proposed method outperforms the existing baseline methods. Our source code is available at: this https URL.

[AI-90] AACessTalk: Fostering Communication between Minimally Verbal Autistic Children and Parents with Contextual Guidance and Card Recommendation

链接: https://arxiv.org/abs/2409.09641
作者: Dasom Choi,SoHyun Park,Kyungah Lee,Hwajung Hong,Young-Ho Kim
关键词-EN: minimally verbal autistic, express subtle emotions, verbal autistic, nonverbal cues, nuanced signals
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 19 pages excluding reference and appendix

点击查看摘要

Abstract:As minimally verbal autistic (MVA) children communicate with parents through few words and nonverbal cues, parents often struggle to encourage their children to express subtle emotions and needs and to grasp their nuanced signals. We present AACessTalk, a tablet-based, AI-mediated communication system that facilitates meaningful exchanges between an MVA child and a parent. AACessTalk provides real-time guides to the parent to engage the child in conversation and, in turn, recommends contextual vocabulary cards to the child. Through a two-week deployment study with 11 MVA child-parent dyads, we examine how AACessTalk fosters everyday conversation practice and mutual engagement. Our findings show high engagement from all dyads, leading to increased frequency of conversation and turn-taking. AACessTalk also encouraged parents to explore their own interaction strategies and empowered the children to have more agency in communication. We discuss the implications of designing technologies for balanced communication dynamics in parent-MVA child interaction.

[AI-91] A Novel Framework For Text Detection From Natural Scene Images With Complex Background

链接: https://arxiv.org/abs/2409.09635
作者: Basavaraj Kaladagi,Jagadeesh Pujari
关键词-EN: Recognizing texts, hard problem, varied and complicated, Wavelet Transforms, Recognizing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Recognizing texts from camera images is a known hard problem because of the difficulties in text detection from the varied and complicated background. In this paper we propose a novel and efficient method to detect text region from images with complex background using Wavelet Transforms. The framework uses Wavelet Transformation of the original image in its grayscale form followed by Sub-band filtering. Then Region clustering technique is applied using centroids of the regions, further Bounding box is fitted to each region thus identifying the text regions. This method is much sophisticated and efficient than the previous methods as it doesn’t stick to a particular font size of the text thus, making it generalized. The sample set used for experimental purpose consists of 50 images with varying backgrounds. Images with edge prominence are considered. Furthermore, our method can be easily customized for applications with different scopes.

[AI-92] Confidence Estimation for LLM-Based Dialogue State Tracking

链接: https://arxiv.org/abs/2409.09629
作者: Yi-Jyun Sun,Suvodip Dey,Dilek Hakkani-Tur,Gokhan Tur
关键词-EN: critical for Conversational, large language models, preventing over-reliance, outputs is critical, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Estimation of a model’s confidence on its outputs is critical for Conversational AI systems based on large language models (LLMs), especially for reducing hallucination and preventing over-reliance. In this work, we provide an exhaustive exploration of methods, including approaches proposed for open- and closed-weight LLMs, aimed at quantifying and leveraging model uncertainty to improve the reliability of LLM-generated responses, specifically focusing on dialogue state tracking (DST) in task-oriented dialogue systems (TODS). Regardless of the model type, well-calibrated confidence scores are essential to handle uncertainties, thereby improving model performance. We evaluate four methods for estimating confidence scores based on softmax, raw token scores, verbalized confidences, and a combination of these methods, using the area under the curve (AUC) metric to assess calibration, with higher AUC indicating better calibration. We also enhance these with a self-probing mechanism, proposed for closed models. Furthermore, we assess these methods using an open-weight model fine-tuned for the task of DST, achieving superior joint goal accuracy (JGA). Our findings also suggest that fine-tuning open-weight LLMs can result in enhanced AUC performance, indicating better confidence score calibration.

[AI-93] Can Large Language Models Grasp Event Signals? Exploring Pure Zero-Shot Event-based Recognition

链接: https://arxiv.org/abs/2409.09628
作者: Zongyou Yu,Qiang Qu,Xiaoming Chen,Chen Wang
关键词-EN: Recent advancements, demonstrated promising results, demonstrated promising, event-based, Recent
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in event-based zero-shot object recognition have demonstrated promising results. However, these methods heavily depend on extensive training and are inherently constrained by the characteristics of CLIP. To the best of our knowledge, this research is the first study to explore the understanding capabilities of large language models (LLMs) for event-based visual content. We demonstrate that LLMs can achieve event-based object recognition without additional training or fine-tuning in conjunction with CLIP, effectively enabling pure zero-shot event-based recognition. Particularly, we evaluate the ability of GPT-4o / 4turbo and two other open-source LLMs to directly recognize event-based visual content. Extensive experiments are conducted across three benchmark datasets, systematically assessing the recognition accuracy of these models. The results show that LLMs, especially when enhanced with well-designed prompts, significantly improve event-based zero-shot recognition performance. Notably, GPT-4o outperforms the compared models and exceeds the recognition accuracy of state-of-the-art event-based zero-shot methods on N-ImageNet by five orders of magnitude. The implementation of this paper is available at \urlthis https URL.

[AI-94] Understanding Simplicity Bias towards Compositional Mappings via Learning Dynamics

链接: https://arxiv.org/abs/2409.09626
作者: Yi Ren,Danica J. Sutherland
关键词-EN: Obtaining compositional mappings, Obtaining compositional, Obtaining, generalize well compositionally, compositional mappings
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 4 pages

点击查看摘要

Abstract:Obtaining compositional mappings is important for the model to generalize well compositionally. To better understand when and how to encourage the model to learn such mappings, we study their uniqueness through different perspectives. Specifically, we first show that the compositional mappings are the simplest bijections through the lens of coding length (i.e., an upper bound of their Kolmogorov complexity). This property explains why models having such mappings can generalize well. We further show that the simplicity bias is usually an intrinsic property of neural network training via gradient descent. That partially explains why some models spontaneously generalize well when they are trained appropriately.

[AI-95] Enhancing Text Annotation through Rationale-Driven Collaborative Few-Shot Prompting

链接: https://arxiv.org/abs/2409.09615
作者: Jianfei Wu,Xubin Wang,Weijia Jia
关键词-EN: human bias, data annotation process, susceptible to human, complicates the management, management of increasingly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The traditional data annotation process is often labor-intensive, time-consuming, and susceptible to human bias, which complicates the management of increasingly complex datasets. This study explores the potential of large language models (LLMs) as automated data annotators to improve efficiency and consistency in annotation tasks. By employing rationale-driven collaborative few-shot prompting techniques, we aim to improve the performance of LLMs in text annotation. We conduct a rigorous evaluation of six LLMs across four benchmark datasets, comparing seven distinct methodologies. Our results demonstrate that collaborative methods consistently outperform traditional few-shot techniques and other baseline approaches, particularly in complex annotation tasks. Our work provides valuable insights and a robust framework for leveraging collaborative learning methods to tackle challenging text annotation tasks.

[AI-96] Rethinking KenLM: Good and Bad Model Ensembles for Efficient Text Quality Filtering in Large Web Corpora

链接: https://arxiv.org/abs/2409.09613
作者: Yungi Kim,Hyunsoo Ha,Sukyung Lee,Jihoo Kim,Seonghoon Yang,Chanjun Park
关键词-EN: efficiently filtering large, filtering large web, large web corpora, train large language, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the increasing demand for substantial amounts of high-quality data to train large language models (LLMs), efficiently filtering large web corpora has become a critical challenge. For this purpose, KenLM, a lightweight n-gram-based language model that operates on CPUs, is widely used. However, the traditional method of training KenLM utilizes only high-quality data and, consequently, does not explicitly learn the linguistic patterns of low-quality data. To address this issue, we propose an ensemble approach that leverages two contrasting KenLMs: (i) Good KenLM, trained on high-quality data; and (ii) Bad KenLM, trained on low-quality data. Experimental results demonstrate that our approach significantly reduces noisy content while preserving high-quality content compared to the traditional KenLM training method. This indicates that our method can be a practical solution with minimal computational overhead for resource-constrained environments.

[AI-97] Integrating Audio Narrations to Strengthen Domain Generalization in Multimodal First-Person Action Recognition

链接: https://arxiv.org/abs/2409.09611
作者: Cagri Gungor,Adriana Kovashka
关键词-EN: First-person activity recognition, rapidly growing due, First-person activity, background scenes, rapidly growing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:First-person activity recognition is rapidly growing due to the widespread use of wearable cameras but faces challenges from domain shifts across different environments, such as varying objects or background scenes. We propose a multimodal framework that improves domain generalization by integrating motion, audio, and appearance features. Key contributions include analyzing the resilience of audio and motion features to domain shifts, using audio narrations for enhanced audio-text alignment, and applying consistency ratings between audio and visual narrations to optimize the impact of audio in recognition during training. Our approach achieves state-of-the-art performance on the ARGO1M dataset, effectively generalizing across unseen scenarios and locations.

[AI-98] owards Data-Centric RLHF: Simple Metrics for Preference Dataset Comparison

链接: https://arxiv.org/abs/2409.09603
作者: Judy Hanwen Shen,Archit Sharma,Jun Qin
关键词-EN: aligning language models, goal of aligning, aligning language, preferences requires data, human preferences requires
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Working Paper

点击查看摘要

Abstract:The goal of aligning language models to human preferences requires data that reveal these preferences. Ideally, time and money can be spent carefully collecting and tailoring bespoke preference data to each downstream application. However, in practice, a select few publicly available preference datasets are often used to train reward models for reinforcement learning from human feedback (RLHF). While new preference datasets are being introduced with increasing frequency, there are currently no existing efforts to measure and compare these datasets. In this paper, we systematically study preference datasets through three perspectives: scale, label noise, and information content. We propose specific metrics for each of these perspectives and uncover different axes of comparison for a better understanding of preference datasets. Our work is a first step towards a data-centric approach to alignment by providing perspectives that aid in training efficiency and iterative data collection for RLHF.

[AI-99] A Survey of Foundation Models for Music Understanding

链接: https://arxiv.org/abs/2409.09601
作者: Wenjun Li,Ying Cai,Ziyang Wu,Wenyi Zhang,Yifan Chen,Rundong Qi,Mengqi Dong,Peigen Chen,Xiao Dong,Fenghao Shi,Lei Guo,Junwei Han,Bao Ge,Tianming Liu,Lin Gan,Tuo Zhang
关键词-EN: daily life, connecting us personally, essential in daily, Music, fulfilling emotional
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: 20 pages, 2 figures

点击查看摘要

Abstract:Music is essential in daily life, fulfilling emotional and entertainment needs, and connecting us personally, socially, and culturally. A better understanding of music can enhance our emotions, cognitive skills, and cultural connections. The rapid advancement of artificial intelligence (AI) has introduced new ways to analyze music, aiming to replicate human understanding of music and provide related services. While the traditional models focused on audio features and simple tasks, the recent development of large language models (LLMs) and foundation models (FMs), which excel in various fields by integrating semantic information and demonstrating strong reasoning abilities, could capture complex musical features and patterns, integrate music with language and incorporate rich musical, emotional and psychological knowledge. Therefore, they have the potential in handling complex music understanding tasks from a semantic perspective, producing outputs closer to human perception. This work, to our best knowledge, is one of the early reviews of the intersection of AI techniques and music understanding. We investigated, analyzed, and tested recent large-scale music foundation models in respect of their music comprehension abilities. We also discussed their limitations and proposed possible future directions, offering insights for researchers in this field.

[AI-100] Improving Statistical Significance in Human Evaluation of Automatic Metrics via Soft Pairwise Accuracy

链接: https://arxiv.org/abs/2409.09598
作者: Brian Thompson,Nitika Mathur,Daniel Deutsch,Huda Khayrallah
关键词-EN: emulates human judgments, Soft Pairwise Accuracy, human judgments, Pairwise Accuracy, automatic metric judgments
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Selecting an automatic metric that best emulates human judgments is often non-trivial, because there is no clear definition of “best emulates.” A meta-metric is required to compare the human judgments to the automatic metric judgments, and metric rankings depend on the choice of meta-metric. We propose Soft Pairwise Accuracy (SPA), a new meta-metric that builds on Pairwise Accuracy (PA) but incorporates the statistical significance of both the human judgments and the metric judgments. SPA allows for more fine-grained comparisons between systems than a simplistic binary win/loss, and addresses a number of shortcomings with PA: it is more stable with respect to both the number of systems and segments used for evaluation, it mitigates the issue of metric ties due to quantization, and it produces more statistically significant results. SPA was selected as the official system-level metric for the 2024 WMT metric shared task.

[AI-101] Open-World Test-Time Training: Self-Training with Contrast Learning

链接: https://arxiv.org/abs/2409.09591
作者: Houcheng Su,Mengzhu Wang,Jiao Li,Bingli Wang,Daixian Liu,Zeheng Wang
关键词-EN: Traditional test-time training, consistent class set, real-world scenarios characterized, addressing domain shifts, Traditional test-time
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10page

点击查看摘要

Abstract:Traditional test-time training (TTT) methods, while addressing domain shifts, often assume a consistent class set, limiting their applicability in real-world scenarios characterized by infinite variety. Open-World Test-Time Training (OWTTT) addresses the challenge of generalizing deep learning models to unknown target domain distributions, especially in the presence of strong Out-of-Distribution (OOD) data. Existing TTT methods often struggle to maintain performance when confronted with strong OOD data. In OWTTT, the focus has predominantly been on distinguishing between overall strong and weak OOD data. However, during the early stages of TTT, initial feature extraction is hampered by interference from strong OOD and corruptions, resulting in diminished contrast and premature classification of certain classes as strong OOD. To address this, we introduce Open World Dynamic Contrastive Learning (OWDCL), an innovative approach that utilizes contrastive learning to augment positive sample pairs. This strategy not only bolsters contrast in the early stages but also significantly enhances model robustness in subsequent stages. In comparison datasets, our OWDCL model has produced the most advanced performance.

[AI-102] ValueCompass: A Framework of Fundamental Values for Human-AI Alignment

链接: https://arxiv.org/abs/2409.09586
作者: Hua Shen,Tiffany Knearem,Reshmi Ghosh,Yu-Ju Yang,Tanushree Mitra,Yun Huang
关键词-EN: increasingly critical, diverse range, range of individuals, alignment, Choose Own Goals
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:As AI systems become more advanced, ensuring their alignment with a diverse range of individuals and societal values becomes increasingly critical. But how can we capture fundamental human values and assess the degree to which AI systems align with them? We introduce ValueCompass, a framework of fundamental values, grounded in psychological theory and a systematic review, to identify and evaluate human-AI alignment. We apply ValueCompass to measure the value alignment of humans and language models (LMs) across four real-world vignettes: collaborative writing, education, public sectors, and healthcare. Our findings uncover risky misalignment between humans and LMs, such as LMs agreeing with values like “Choose Own Goals”, which are largely disagreed by humans. We also observe values vary across vignettes, underscoring the necessity for context-aware AI alignment strategies. This work provides insights into the design space of human-AI alignment, offering foundations for developing AI that responsibly reflects societal values and ethics.

[AI-103] MindScape Study: Integrating LLM and Behavioral Sensing for Personalized AI-Driven Journaling Experiences

链接: https://arxiv.org/abs/2409.09570
作者: Subigya Nepal,Arvind Pillai,William Campbell,Talie Massachi,Michael V. Heinz,Ashmita Kunwar,Eunsol Soul Choi,Orson Xu,Joanna Kuc,Jeremy Huckins,Jason Holden,Sarah M. Preum,Colin Depp,Nicholas Jacobson,Mary Czerwinski,Eric Granholm,Andrew T. Campbell
关键词-EN: Large Language Models, Mental health concerns, concerns are prevalent, effective interventions, interventions that promote
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2404.00487

点击查看摘要

Abstract:Mental health concerns are prevalent among college students, highlighting the need for effective interventions that promote self-awareness and holistic well-being. MindScape pioneers a novel approach to AI-powered journaling by integrating passively collected behavioral patterns such as conversational engagement, sleep, and location with Large Language Models (LLMs). This integration creates a highly personalized and context-aware journaling experience, enhancing self-awareness and well-being by embedding behavioral intelligence into AI. We present an 8-week exploratory study with 20 college students, demonstrating the MindScape app’s efficacy in enhancing positive affect (7%), reducing negative affect (11%), loneliness (6%), and anxiety and depression, with a significant week-over-week decrease in PHQ-4 scores (-0.25 coefficient), alongside improvements in mindfulness (7%) and self-reflection (6%). The study highlights the advantages of contextual AI journaling, with participants particularly appreciating the tailored prompts and insights provided by the MindScape app. Our analysis also includes a comparison of responses to AI-driven contextual versus generic prompts, participant feedback insights, and proposed strategies for leveraging contextual AI journaling to improve well-being on college campuses. By showcasing the potential of contextual AI journaling to support mental health, we provide a foundation for further investigation into the effects of contextual AI journaling on mental health and well-being.

[AI-104] G-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings

链接: https://arxiv.org/abs/2409.09564
作者: Dawei Yan,Pengcheng Li,Yang Li,Hao Chen,Qingguo Chen,Weihua Luo,Wei Dong,Qingsen Yan,Haokui Zhang,Chunhua Shen
关键词-EN: achieved promising results, vision encoder, success of vision-language, increasing number, number of researchers
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Currently, inspired by the success of vision-language models (VLMs), an increasing number of researchers are focusing on improving VLMs and have achieved promising results. However, most existing methods concentrate on optimizing the connector and enhancing the language model component, while neglecting improvements to the vision encoder itself. In contrast, we propose Text Guided LLaVA (TG-LLaVA) in this paper, which optimizes VLMs by guiding the vision encoder with text, offering a new and orthogonal optimization direction. Specifically, inspired by the purpose-driven logic inherent in human behavior, we use learnable latent embeddings as a bridge to analyze textual instruction and add the analysis results to the vision encoder as guidance, refining it. Subsequently, another set of latent embeddings extracts additional detailed text-guided information from high-resolution local patches as auxiliary information. Finally, with the guidance of text, the vision encoder can extract text-related features, similar to how humans focus on the most relevant parts of an image when considering a question. This results in generating better answers. Experiments on various datasets validate the effectiveness of the proposed method. Remarkably, without the need for additional training data, our propsoed method can bring more benefits to the baseline (LLaVA-1.5) compared with other concurrent methods. Furthermore, the proposed method consistently brings improvement in different settings.

[AI-105] Evaluating authenticity and quality of image captions via sentiment and semantic analyses

链接: https://arxiv.org/abs/2409.09560
作者: Aleksei Krotov,Alison Tebo,Dylan K. Picart,Aaron Dean Algave
关键词-EN: natural language processing, relies heavily, growth of deep, heavily on huge, huge amounts
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growth of deep learning (DL) relies heavily on huge amounts of labelled data for tasks such as natural language processing and computer vision. Specifically, in image-to-text or image-to-image pipelines, opinion (sentiment) may be inadvertently learned by a model from human-generated image captions. Additionally, learning may be affected by the variety and diversity of the provided captions. While labelling large datasets has largely relied on crowd-sourcing or data-worker pools, evaluating the quality of such training data is crucial. This study proposes an evaluation method focused on sentiment and semantic richness. That method was applied to the COCO-MS dataset, comprising approximately 150K images with segmented objects and corresponding crowd-sourced captions. We employed pre-trained models (Twitter-RoBERTa-base and BERT-base) to extract sentiment scores and variability of semantic embeddings from captions. The relation of the sentiment score and semantic variability with object categories was examined using multiple linear regression. Results indicate that while most captions were neutral, about 6% of the captions exhibited strong sentiment influenced by specific object categories. Semantic variability of within-image captions remained low and uncorrelated with object categories. Model-generated captions showed less than 1.5% of strong sentiment which was not influenced by object categories and did not correlate with the sentiment of the respective human-generated captions. This research demonstrates an approach to assess the quality of crowd- or worker-sourced captions informed by image content. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2409.09560 [cs.CV] (or arXiv:2409.09560v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.09560 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-106] Enhancing Printed Circuit Board Defect Detection through Ensemble Learning

链接: https://arxiv.org/abs/2409.09555
作者: Ka Nam Canaan Law,Mingshuo Yu,Lianglei Zhang,Yiyi Zhang,Peng Xu,Jerry Gao,Jun Liu
关键词-EN: printed circuit boards, electronic device technology, advancing electronic device, circuit boards, device technology
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The quality control of printed circuit boards (PCBs) is paramount in advancing electronic device technology. While numerous machine learning methodologies have been utilized to augment defect detection efficiency and accuracy, previous studies have predominantly focused on optimizing individual models for specific defect types, often overlooking the potential synergies between different approaches. This paper introduces a comprehensive inspection framework leveraging an ensemble learning strategy to address this gap. Initially, we utilize four distinct PCB defect detection models utilizing state-of-the-art methods: EfficientDet, MobileNet SSDv2, Faster RCNN, and YOLOv5. Each method is capable of identifying PCB defects independently. Subsequently, we integrate these models into an ensemble learning framework to enhance detection performance. A comparative analysis reveals that our ensemble learning framework significantly outperforms individual methods, achieving a 95% accuracy in detecting diverse PCB defects. These findings underscore the efficacy of our proposed ensemble learning framework in enhancing PCB quality control processes.

[AI-107] COMFORT: A Continual Fine-Tuning Framework for Foundation Models Targeted at Consumer Healthcare

链接: https://arxiv.org/abs/2409.09549
作者: Chia-Hao Li,Niraj K. Jha
关键词-EN: Wearable medical sensors, Wearable medical, revolutionizing smart healthcare, medical sensors, enabling continuous
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 25 pages, 10 figures. This work has been submitted to the ACM for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Wearable medical sensors (WMSs) are revolutionizing smart healthcare by enabling continuous, real-time monitoring of user physiological signals, especially in the field of consumer healthcare. The integration of WMSs and modern machine learning (ML) enables unprecedented solutions to efficient early-stage disease detection. Despite the success of Transformers in various fields, their application to sensitive domains, such as smart healthcare, remains underexplored due to limited data accessibility and privacy concerns. To bridge the gap between Transformer-based foundation models and WMS-based disease detection, we propose COMFORT, a continual fine-tuning framework for foundation models targeted at consumer healthcare. COMFORT introduces a novel approach for pre-training a Transformer-based foundation model on a large dataset of physiological signals exclusively collected from healthy individuals with commercially available WMSs. We adopt a masked data modeling (MDM) objective to pre-train this health foundation model. We then fine-tune the model using various parameter-efficient fine-tuning (PEFT) methods, such as low-rank adaptation (LoRA) and its variants, to adapt it to various downstream disease detection tasks that rely on WMS data. In addition, COMFORT continually stores the low-rank decomposition matrices obtained from the PEFT algorithms to construct a library for multi-disease detection. The COMFORT library enables scalable and memory-efficient disease detection on edge devices. Our experimental results demonstrate that COMFORT achieves highly competitive performance while reducing memory overhead by up to 52% relative to conventional methods. Thus, COMFORT paves the way for personalized and proactive solutions to efficient and effective early-stage disease detection for consumer healthcare.

[AI-108] Autonomous Goal Detection and Cessation in Reinforcement Learning: A Case Study on Source Term Estimation

链接: https://arxiv.org/abs/2409.09541
作者: Yiwei Shi,Muning Wen,Qi Zhang,Weinan Zhang,Cunjia Liu,Weiru Liu
关键词-EN: Reinforcement Learning, revolutionized decision-making processes, clear feedback signals, Source Term Estimation, Learning has revolutionized
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning has revolutionized decision-making processes in dynamic environments, yet it often struggles with autonomously detecting and achieving goals without clear feedback signals. For example, in a Source Term Estimation problem, the lack of precise environmental information makes it challenging to provide clear feedback signals and to define and evaluate how the source’s location is determined. To address this challenge, the Autonomous Goal Detection and Cessation (AGDC) module was developed, enhancing various RL algorithms by incorporating a self-feedback mechanism for autonomous goal detection and cessation upon task completion. Our method effectively identifies and ceases undefined goals by approximating the agent’s belief, significantly enhancing the capabilities of RL algorithms in environments with limited feedback. To validate effectiveness of our approach, we integrated AGDC with deep Q-Network, proximal policy optimization, and deep deterministic policy gradient algorithms, and evaluated its performance on the Source Term Estimation problem. The experimental results showed that AGDC-enhanced RL algorithms significantly outperformed traditional statistical methods such as infotaxis, entrotaxis, and dual control for exploitation and exploration, as well as a non-statistical random action selection method. These improvements were evident in terms of success rate, mean traveled distance, and search time, highlighting AGDC’s effectiveness and efficiency in complex, real-world scenarios.

[AI-109] VernaCopter: Disambiguated Natural-Language-Driven Robot via Formal Specifications

链接: https://arxiv.org/abs/2409.09536
作者: Teun van de Laar,Zengjie Zhang,Shuhao Qi,Sofie Haesaert,Zhiyong Sun
关键词-EN: natural language, large language models, complex task, language models, planner
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:It has been an ambition of many to control a robot for a complex task using natural language (NL). The rise of large language models (LLMs) makes it closer to coming true. However, an LLM-powered system still suffers from the ambiguity inherent in an NL and the uncertainty brought up by LLMs. This paper proposes a novel LLM-based robot motion planner, named \textitVernaCopter, with signal temporal logic (STL) specifications serving as a bridge between NL commands and specific task objectives. The rigorous and abstract nature of formal specifications allows the planner to generate high-quality and highly consistent paths to guide the motion control of a robot. Compared to a conventional NL-prompting-based planner, the proposed VernaCopter planner is more stable and reliable due to less ambiguous uncertainty. Its efficacy and advantage have been validated by two small but challenging experimental scenarios, implying its potential in designing NL-driven robots.

[AI-110] Enhancing Skin Disease Diagnosis: Interpretable Visual Concept Discovery with SAM Empowerment

链接: https://arxiv.org/abs/2409.09520
作者: Xin Hu,Janet Wang,Jihun Hamm,Rie R Yotsu,Zhengming Ding
关键词-EN: deep learning architectures, Current AI-assisted skin, Current AI-assisted, achieved dermatologist-level performance, classifying skin cancer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Current AI-assisted skin image diagnosis has achieved dermatologist-level performance in classifying skin cancer, driven by rapid advancements in deep learning architectures. However, unlike traditional vision tasks, skin images in general present unique challenges due to the limited availability of well-annotated datasets, complex variations in conditions, and the necessity for detailed interpretations to ensure patient safety. Previous segmentation methods have sought to reduce image noise and enhance diagnostic performance, but these techniques require fine-grained, pixel-level ground truth masks for training. In contrast, with the rise of foundation models, the Segment Anything Model (SAM) has been introduced to facilitate promptable segmentation, enabling the automation of the segmentation process with simple yet effective prompts. Efforts applying SAM predominantly focus on dermatoscopy images, which present more easily identifiable lesion boundaries than clinical photos taken with smartphones. This limitation constrains the practicality of these approaches to real-world applications. To overcome the challenges posed by noisy clinical photos acquired via non-standardized protocols and to improve diagnostic accessibility, we propose a novel Cross-Attentive Fusion framework for interpretable skin lesion diagnosis. Our method leverages SAM to generate visual concepts for skin diseases using prompts, integrating local visual concepts with global image features to enhance model performance. Extensive evaluation on two skin disease datasets demonstrates our proposed method’s effectiveness on lesion diagnosis and interpretability.

[AI-111] Deep Learning Under Siege: Identifying Security Vulnerabilities and Risk Mitigation Strategies

链接: https://arxiv.org/abs/2409.09517
作者: Jamal Al-Karaki,Muhammad Al-Zafar Khan,Mostafa Mohamad,Dababrata Chowdhury
关键词-EN: Deep Learning, adoption of Deep, aspects of society, wholesale adoption, unique set
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 10 pages, 1 table, 6 equations/metrics

点击查看摘要

Abstract:With the rise in the wholesale adoption of Deep Learning (DL) models in nearly all aspects of society, a unique set of challenges is imposed. Primarily centered around the architectures of these models, these risks pose a significant challenge, and addressing these challenges is key to their successful implementation and usage in the future. In this research, we present the security challenges associated with the current DL models deployed into production, as well as anticipate the challenges of future DL technologies based on the advancements in computing, AI, and hardware technologies. In addition, we propose risk mitigation techniques to inhibit these challenges and provide metrical evaluations to measure the effectiveness of these metrics.

[AI-112] Planning Transformer: Long-Horizon Offline Reinforcement Learning with Planning Tokens AAAI

链接: https://arxiv.org/abs/2409.09513
作者: Joseph Clinton,Robert Lieck
关键词-EN: Supervised learning approaches, Decision Transformer, offline reinforcement learning, Supervised learning, utilizing the Decision
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 11 pages, 5 figures, Submitted to AAAI

点击查看摘要

Abstract:Supervised learning approaches to offline reinforcement learning, particularly those utilizing the Decision Transformer, have shown effectiveness in continuous environments and for sparse rewards. However, they often struggle with long-horizon tasks due to the high compounding error of auto-regressive models. To overcome this limitation, we go beyond next-token prediction and introduce Planning Tokens, which contain high-level, long time-scale information about the agent’s future. Predicting dual time-scale tokens at regular intervals enables our model to use these long-horizon Planning Tokens as a form of implicit planning to guide its low-level policy and reduce compounding error. This architectural modification significantly enhances performance on long-horizon tasks, establishing a new state-of-the-art in complex D4RL environments. Additionally, we demonstrate that Planning Tokens improve the interpretability of the model’s policy through the interpretable plan visualisations and attention map.

[AI-113] Explaining Deep Learning Embeddings for Speech Emotion Recognition by Predicting Interpretable Acoustic Features

链接: https://arxiv.org/abs/2409.09511
作者: Satvik Dixit,Daniel M. Low,Gasser Elbanna,Fabio Catania,Satrajit S. Ghosh
关键词-EN: consistently shown superior, shown superior performance, Pre-trained deep learning, acoustic features, Pre-trained deep
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Pre-trained deep learning embeddings have consistently shown superior performance over handcrafted acoustic features in speech emotion recognition (SER). However, unlike acoustic features with clear physical meaning, these embeddings lack clear interpretability. Explaining these embeddings is crucial for building trust in healthcare and security applications and advancing the scientific understanding of the acoustic information that is encoded in them. This paper proposes a modified probing approach to explain deep learning embeddings in the SER space. We predict interpretable acoustic features (e.g., f0, loudness) from (i) the complete set of embeddings and (ii) a subset of the embedding dimensions identified as most important for predicting each emotion. If the subset of the most important dimensions better predicts a given emotion than all dimensions and also predicts specific acoustic features more accurately, we infer those acoustic features are important for the embedding model for the given task. We conducted experiments using the WavLM embeddings and eGeMAPS acoustic features as audio representations, applying our method to the RAVDESS and SAVEE emotional speech datasets. Based on this evaluation, we demonstrate that Energy, Frequency, Spectral, and Temporal categories of acoustic features provide diminishing information to SER in that order, demonstrating the utility of the probing classifier method to relate embeddings to interpretable acoustic features.

[AI-114] ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration

链接: https://arxiv.org/abs/2409.09506
作者: Masao Someki,Kwanghee Choi,Siddhant Arora,William Chen,Samuele Cornell,Jionghao Han,Yifan Peng,Jiatong Shi,Vaibhav Srivastav,Shinji Watanabe
关键词-EN: processing toolkit ESPnet, open-source speech processing, speech processing toolkit, Hugging Face transformers, aimed at quick
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Accepted to SLT 2024

点击查看摘要

Abstract:We introduce ESPnet-EZ, an extension of the open-source speech processing toolkit ESPnet, aimed at quick and easy development of speech models. ESPnet-EZ focuses on two major aspects: (i) easy fine-tuning and inference of existing ESPnet models on various tasks and (ii) easy integration with popular deep neural network frameworks such as PyTorch-Lightning, Hugging Face transformers and datasets, and Lhotse. By replacing ESPnet design choices inherited from Kaldi with a Python-only, Bash-free interface, we dramatically reduce the effort required to build, debug, and use a new model. For example, to fine-tune a speech foundation model, ESPnet-EZ, compared to ESPnet, reduces the number of newly written code by 2.7x and the amount of dependent code by 6.7x while dramatically reducing the Bash script dependencies. The codebase of ESPnet-EZ is publicly available.

[AI-115] Synthetic4Health: Generating Annotated Synthetic Clinical Letters

链接: https://arxiv.org/abs/2409.09501
作者: Libo Ren,Samuel Belkadi,Lifeng Han,Warren Del-Pinto,Goran Nenadic
关键词-EN: medical research, clinical-related datasets, widely applied, clinical, Named Entity Recognition
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: ongoing work, 48 pages

点击查看摘要

Abstract:Since clinical letters contain sensitive information, clinical-related datasets can not be widely applied in model training, medical research, and teaching. This work aims to generate reliable, various, and de-identified synthetic clinical letters. To achieve this goal, we explored different pre-trained language models (PLMs) for masking and generating text. After that, we worked on Bio_ClinicalBERT, a high-performing model, and experimented with different masking strategies. Both qualitative and quantitative methods were used for evaluation. Additionally, a downstream task, Named Entity Recognition (NER), was also implemented to assess the usability of these synthetic letters. The results indicate that 1) encoder-only models outperform encoder-decoder models. 2) Among encoder-only models, those trained on general corpora perform comparably to those trained on clinical data when clinical information is preserved. 3) Additionally, preserving clinical entities and document structure better aligns with our objectives than simply fine-tuning the model. 4) Furthermore, different masking strategies can impact the quality of synthetic clinical letters. Masking stopwords has a positive impact, while masking nouns or verbs has a negative effect. 5) For evaluation, BERTScore should be the primary quantitative evaluation metric, with other metrics serving as supplementary references. 6) Contextual information does not significantly impact the models’ understanding, so the synthetic clinical letters have the potential to replace the original ones in downstream tasks. Comments: ongoing work, 48 pages Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.09501 [cs.CL] (or arXiv:2409.09501v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.09501 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-116] Multi-Scale Grouped Prototypes for Interpretable Semantic Segmentation

链接: https://arxiv.org/abs/2409.09497
作者: Hugo Porta,Emanuele Dalsasso,Diego Marcos,Devis Tuia
关键词-EN: Prototypical part learning, making semantic segmentation, promising approach, approach for making, Prototypical part
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Prototypical part learning is emerging as a promising approach for making semantic segmentation interpretable. The model selects real patches seen during training as prototypes and constructs the dense prediction map based on the similarity between parts of the test image and the prototypes. This improves interpretability since the user can inspect the link between the predicted output and the patterns learned by the model in terms of prototypical information. In this paper, we propose a method for interpretable semantic segmentation that leverages multi-scale image representation for prototypical part learning. First, we introduce a prototype layer that explicitly learns diverse prototypical parts at several scales, leading to multi-scale representations in the prototype activation output. Then, we propose a sparse grouping mechanism that produces multi-scale sparse groups of these scale-specific prototypical parts. This provides a deeper understanding of the interactions between multi-scale object representations while enhancing the interpretability of the segmentation model. The experiments conducted on Pascal VOC, Cityscapes, and ADE20K demonstrate that the proposed method increases model sparsity, improves interpretability over existing prototype-based methods, and narrows the performance gap with the non-interpretable counterpart models. Code is available at this http URL.

[AI-117] Hacking The Lazy Way: LLM Augmented Pentesting

链接: https://arxiv.org/abs/2409.09493
作者: Dhruva Goyal,Sitaraman Subramanian,Aditya Peela
关键词-EN: rapidly evolving cybersecurity, continually challenged, stay current, current with rapidly, rapidly evolving
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 9 pages, 7 figures

点击查看摘要

Abstract:Security researchers are continually challenged by the need to stay current with rapidly evolving cybersecurity research, tools, and techniques. This constant cycle of learning, unlearning, and relearning, combined with the repetitive tasks of sifting through documentation and analyzing data, often hinders productivity and innovation. This has led to a disparity where only organizations with substantial resources can access top-tier security experts, while others rely on firms with less skilled researchers who focus primarily on compliance rather than actual security. We introduce “LLM Augmented Pentesting,” demonstrated through a tool named “Pentest Copilot,” to address this gap. This approach integrates Large Language Models into penetration testing workflows. Our research includes a “chain of thought” mechanism to streamline token usage and boost performance, as well as unique Retrieval Augmented Generation implementation to minimize hallucinations and keep models aligned with the latest techniques. Additionally, we propose a novel file analysis approach, enabling LLMs to understand files. Furthermore, we highlight a unique infrastructure system that supports if implemented, can support in-browser assisted penetration testing, offering a robust platform for cybersecurity professionals, These advancements mark a significant step toward bridging the gap between automated tools and human expertise, offering a powerful solution to the challenges faced by modern cybersecurity teams. Comments: 9 pages, 7 figures Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) ACMclasses: I.2.1 Cite as: arXiv:2409.09493 [cs.CR] (or arXiv:2409.09493v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2409.09493 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-118] Enumerating Minimal Unsatisfiable Cores of LTLf formulas

链接: https://arxiv.org/abs/2409.09485
作者: Antonio Ielo,Giuseppe Mazzotta,Rafael Peñaloza,Francesco Ricca
关键词-EN: Linear Temporal Logic, Linear Temporal, Temporal Logic, LTL, Logic over finite
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Linear Temporal Logic over finite traces ( \textLTL_f ) is a widely used formalism with applications in AI, process mining, model checking, and more. The primary reasoning task for \textLTL_f is satisfiability checking; yet, the recent focus on explainable AI has increased interest in analyzing inconsistent formulas, making the enumeration of minimal explanations for infeasibility a relevant task also for \textLTL_f . This paper introduces a novel technique for enumerating minimal unsatisfiable cores (MUCs) of an \textLTL_f specification. The main idea is to encode a \textLTL_f formula into an Answer Set Programming (ASP) specification, such that the minimal unsatisfiable subsets (MUSes) of the ASP program directly correspond to the MUCs of the original \textLTL_f specification. Leveraging recent advancements in ASP solving yields a MUC enumerator achieving good performance in experiments conducted on established benchmarks from the literature.

[AI-119] X-Gen: Multi-Objective Optimization for Sparse Counterfactual Explanations for Time-Series Classification

链接: https://arxiv.org/abs/2409.09461
作者: Qi Huang,Sofoklis Kitharidis,Thomas Bäck,Niki van Stein
关键词-EN: understanding model decisions, healthcare and finance, decisions is crucial, application in high-stakes, high-stakes domains
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: Preprint, under review

点击查看摘要

Abstract:In time-series classification, understanding model decisions is crucial for their application in high-stakes domains such as healthcare and finance. Counterfactual explanations, which provide insights by presenting alternative inputs that change model predictions, offer a promising solution. However, existing methods for generating counterfactual explanations for time-series data often struggle with balancing key objectives like proximity, sparsity, and validity. In this paper, we introduce TX-Gen, a novel algorithm for generating counterfactual explanations based on the Non-dominated Sorting Genetic Algorithm II (NSGA-II). TX-Gen leverages evolutionary multi-objective optimization to find a diverse set of counterfactuals that are both sparse and valid, while maintaining minimal dissimilarity to the original time series. By incorporating a flexible reference-guided mechanism, our method improves the plausibility and interpretability of the counterfactuals without relying on predefined assumptions. Extensive experiments on benchmark datasets demonstrate that TX-Gen outperforms existing methods in generating high-quality counterfactuals, making time-series models more transparent and interpretable.

[AI-120] MulCPred: Learning Multi-modal Concepts for Explainable Pedestrian Action Prediction

链接: https://arxiv.org/abs/2409.09446
作者: Yan Feng,Alexander Carballo,Keisuke Fujii,Robin Karlsson,Ming Ding,Kazuya Takeda
关键词-EN: autonomous driving, great significance, Pedestrian action prediction, concepts, Pedestrian action
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Pedestrian action prediction is of great significance for many applications such as autonomous driving. However, state-of-the-art methods lack explainability to make trustworthy predictions. In this paper, a novel framework called MulCPred is proposed that explains its predictions based on multi-modal concepts represented by training samples. Previous concept-based methods have limitations including: 1) they cannot directly apply to multi-modal cases; 2) they lack locality to attend to details in the inputs; 3) they suffer from mode collapse. These limitations are tackled accordingly through the following approaches: 1) a linear aggregator to integrate the activation results of the concepts into predictions, which associates concepts of different modalities and provides ante-hoc explanations of the relevance between the concepts and the predictions; 2) a channel-wise recalibration module that attends to local spatiotemporal regions, which enables the concepts with locality; 3) a feature regularization loss that encourages the concepts to learn diverse patterns. MulCPred is evaluated on multiple datasets and tasks. Both qualitative and quantitative results demonstrate that MulCPred is promising in improving the explainability of pedestrian action prediction without obvious performance degradation. Furthermore, by removing unrecognizable concepts from MulCPred, the cross-dataset prediction performance is improved, indicating the feasibility of further generalizability of MulCPred.

[AI-121] NBBOX: Noisy Bounding Box Improves Remote Sensing Object Detection

链接: https://arxiv.org/abs/2409.09424
作者: Yechan Kim,SooYeon Kim,Moongu Jeon
关键词-EN: insufficient data, bounding box, significant advancements, advancements in computer, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data augmentation has seen significant advancements in computer vision to improve model performance over the years, particularly in scenarios with limited and insufficient data. Currently, most studies focus on adjusting the image or its features to expand the size, quality, and variety of samples during training in various tasks including object detection. However, we argue that it is necessary to investigate bounding box transformations as a model regularization technique rather than image-level transformations, especially in aerial imagery due to potentially inconsistent bounding box annotations. Hence, this letter presents a thorough investigation of bounding box transformation in terms of scaling, rotation, and translation for remote sensing object detection. We call this augmentation strategy NBBOX (Noise Injection into Bounding Box). We conduct extensive experiments on DOTA and DIOR-R, both well-known datasets that include a variety of rotated generic objects in aerial images. Experimental results show that our approach significantly improves remote sensing object detection without whistles and bells and it is more time-efficient than other state-of-the-art augmentation strategies.

[AI-122] Distributed Clustering based on Distributional Kernel

链接: https://arxiv.org/abs/2409.09418
作者: Hang Zhang,Yang Xu,Lei Gong,Ye Zhu,Kai Ming Ting
关键词-EN: final clusters based, Distributed Clustering based, clustering, produces the final, similarity with respect
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper introduces a new framework for clustering in a distributed network called Distributed Clustering based on Distributional Kernel (K) or KDC that produces the final clusters based on the similarity with respect to the distributions of initial clusters, as measured by K. It is the only framework that satisfies all three of the following properties. First, KDC guarantees that the combined clustering outcome from all sites is equivalent to the clustering outcome of its centralized counterpart from the combined dataset from all sites. Second, the maximum runtime cost of any site in distributed mode is smaller than the runtime cost in centralized mode. Third, it is designed to discover clusters of arbitrary shapes, sizes and densities. To the best of our knowledge, this is the first distributed clustering framework that employs a distributional kernel. The distribution-based clustering leads directly to significantly better clustering outcomes than existing methods of distributed clustering. In addition, we introduce a new clustering algorithm called Kernel Bounded Cluster Cores, which is the best clustering algorithm applied to KDC among existing clustering algorithms. We also show that KDC is a generic framework that enables a quadratic time clustering algorithm to deal with large datasets that would otherwise be impossible.

[AI-123] Enhancing LLM Problem Solving with REAP: Reflection Explicit Problem Deconstruction and Advanced Prompting

链接: https://arxiv.org/abs/2409.09415
作者: Ryan Lingo,Martin Arroyo,Rajeev Chhajer
关键词-EN: Large Language Models, natural language processing, transformed natural language, Large Language, Explicit Problem Deconstruction
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 524 pages, 3 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have transformed natural language processing, yet improving their problem-solving capabilities, particularly for complex, reasoning-intensive tasks, remains a persistent challenge. This paper introduces the REAP (Reflection, Explicit Problem Deconstruction, and Advanced Prompting) method, an innovative approach within the dynamic context generation framework. REAP guides LLMs through reflection on the query, deconstructing it into manageable components, and generating relevant context to enhance the solution process. We evaluated REAP using a dataset designed to expose LLM limitations, comparing zero-shot prompting with REAP-enhanced prompts across six state-of-the-art models: OpenAI’s o1-preview, o1-mini, GPT-4o, GPT-4o-mini, Google’s Gemini 1.5 Pro, and Claude 3.5 Sonnet. The results demonstrate notable performance gains, with o1-mini improving by 40.97%, GPT-4o by 66.26%, and GPT-4o-mini by 112.93%. Despite the already strong baseline performance of OpenAI’s o1-preview, modest gains were observed. Beyond performance improvements, REAP offers a cost-effective solution; for example, GPT-4o-mini, which is approximately 100 times cheaper than o1-preview, delivered competitive results. REAP also improves the clarity of model outputs, making it easier for humans to understand the reasoning behind the results and simplifying the process of identifying and addressing any issues. These findings demonstrate REAP’s potential to greatly improve the capabilities of LLMs, providing both better performance and increased cost-efficiency across a wide range of applications.

[AI-124] Constructive Approach to Bidirectional Causation between Qualia Structure and Language Emergence

链接: https://arxiv.org/abs/2409.09413
作者: Tadahiro Taniguchi,Masafumi Oizumi,Noburo Saji,Takato Horii,Naotsugu Tsuchiya
关键词-EN: termed qualia structure, language emergence, language, internal representations, relational structure
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 20 pages, 4 Figures

点击查看摘要

Abstract:This paper presents a novel perspective on the bidirectional causation between language emergence and relational structure of subjective experiences, termed qualia structure, and lays out the constructive approach to the intricate dependency between the two. We hypothesize that languages with distributional semantics, e.g., syntactic-semantic structures, may have emerged through the process of aligning internal representations among individuals, and such alignment of internal representations facilitates more structured language. This mutual dependency is suggested by the recent advancements in AI and symbol emergence robotics, and collective predictive coding (CPC) hypothesis, in particular. Computational studies show that neural network-based language models form systematically structured internal representations, and multimodal language models can share representations between language and perceptual information. This perspective suggests that language emergence serves not only as a mechanism creating a communication tool but also as a mechanism for allowing people to realize shared understanding of qualitative experiences. The paper discusses the implications of this bidirectional causation in the context of consciousness studies, linguistics, and cognitive science, and outlines future constructive research directions to further explore this dynamic relationship between language emergence and qualia structure.

[AI-125] Label Convergence: Defining an Upper Performance Bound in Object Recognition through Contradictory Annotations

链接: https://arxiv.org/abs/2409.09412
作者: David Tschirschwitz,Volker Rodehorst
关键词-EN: machine learning models, training of machine, machine learning, label convergence, Label
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Annotation errors are a challenge not only during training of machine learning models, but also during their evaluation. Label variations and inaccuracies in datasets often manifest as contradictory examples that deviate from established labeling conventions. Such inconsistencies, when significant, prevent models from achieving optimal performance on metrics such as mean Average Precision (mAP). We introduce the notion of “label convergence” to describe the highest achievable performance under the constraint of contradictory test annotations, essentially defining an upper bound on model accuracy. Recognizing that noise is an inherent characteristic of all data, our study analyzes five real-world datasets, including the LVIS dataset, to investigate the phenomenon of label convergence. We approximate that label convergence is between 62.63-67.52 mAP@[0.5:0.95:0.05] for LVIS with 95% confidence, attributing these bounds to the presence of real annotation errors. With current state-of-the-art (SOTA) models at the upper end of the label convergence interval for the well-studied LVIS dataset, we conclude that model capacity is sufficient to solve current object detection problems. Therefore, future efforts should focus on three key aspects: (1) updating the problem specification and adjusting evaluation practices to account for unavoidable label noise, (2) creating cleaner data, especially test data, and (3) including multi-annotated data to investigate annotation variation and make these issues visible from the outset. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.09412 [cs.CV] (or arXiv:2409.09412v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.09412 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-126] Real-world Adversarial Defense against Patch Attacks based on Diffusion Model

链接: https://arxiv.org/abs/2409.09406
作者: Xingxing Wei,Caixin Kang,Yinpeng Dong,Zhengyi Wang,Shouwei Ruan,Yubo Chen,Hang Su
关键词-EN: deep learning models, diffusion model, Adversarial patches present, present significant challenges, Adversarial Anomaly Perception
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adversarial patches present significant challenges to the robustness of deep learning models, making the development of effective defenses become critical for real-world applications. This paper introduces DIFFender, a novel DIFfusion-based DeFender framework that leverages the power of a text-guided diffusion model to counter adversarial patch attacks. At the core of our approach is the discovery of the Adversarial Anomaly Perception (AAP) phenomenon, which enables the diffusion model to accurately detect and locate adversarial patches by analyzing distributional anomalies. DIFFender seamlessly integrates the tasks of patch localization and restoration within a unified diffusion model framework, enhancing defense efficacy through their close interaction. Additionally, DIFFender employs an efficient few-shot prompt-tuning algorithm, facilitating the adaptation of the pre-trained diffusion model to defense tasks without the need for extensive retraining. Our comprehensive evaluation, covering image classification and face recognition tasks, as well as real-world scenarios, demonstrates DIFFender’s robust performance against adversarial attacks. The framework’s versatility and generalizability across various settings, classifiers, and attack methodologies mark a significant advancement in adversarial patch defense strategies. Except for the popular visible domain, we have identified another advantage of DIFFender: its capability to easily expand into the infrared domain. Consequently, we demonstrate the good flexibility of DIFFender, which can defend against both infrared and visible adversarial patch attacks alternatively using a universal defense framework.

[AI-127] AI-Driven Virtual Teacher for Enhanced Educational Efficiency: Leveraging Large Pretrain Models for Autonomous Error Analysis and Correction

链接: https://arxiv.org/abs/2409.09403
作者: Tianlong Xu,Yi-Fan Zhang,Zhendong Chu,Shen Wang,Qingsong Wen
关键词-EN: solving mathematical problems, frequently make mistakes, Students frequently make, textbf, mathematical problems
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Students frequently make mistakes while solving mathematical problems, and traditional error correction methods are both time-consuming and labor-intensive. This paper introduces an innovative \textbfVirtual \textbfAI \textbfTeacher system designed to autonomously analyze and correct student \textbfErrors (VATE). Leveraging advanced large language models (LLMs), the system uses student drafts as a primary source for error analysis, which enhances understanding of the student’s learning process. It incorporates sophisticated prompt engineering and maintains an error pool to reduce computational overhead. The AI-driven system also features a real-time dialogue component for efficient student interaction. Our approach demonstrates significant advantages over traditional and machine learning-based error correction methods, including reduced educational costs, high scalability, and superior generalizability. The system has been deployed on the Squirrel AI learning platform for elementary mathematics education, where it achieves 78.3% accuracy in error analysis and shows a marked improvement in student learning efficiency. Satisfaction surveys indicate a strong positive reception, highlighting the system’s potential to transform educational practices.

[AI-128] AMBER – Advanced SegFormer for Multi-Band Image Segmentation: an application to Hyperspectral Imaging

链接: https://arxiv.org/abs/2409.09386
作者: Andrea Dosi,Massimo Brescia,Stefano Cavuoti,Mariarca D’Aniello,Michele Delli Veneri,Carlo Donadio,Adriano Ettari,Giuseppe Longo,Alvi Rownok,Luca Sannino,Maria Zampella
关键词-EN: Deep learning, enabling the extraction, learning has revolutionized, revolutionized the field, extraction of complex
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: submitted to Neural Computing Applications (Springer). Currently under review

点击查看摘要

Abstract:Deep learning has revolutionized the field of hyperspectral image (HSI) analysis, enabling the extraction of complex and hierarchical features. While convolutional neural networks (CNNs) have been the backbone of HSI classification, their limitations in capturing global contextual features have led to the exploration of Vision Transformers (ViTs). This paper introduces AMBER, an advanced SegFormer specifically designed for multi-band image segmentation. AMBER enhances the original SegFormer by incorporating three-dimensional convolutions to handle hyperspectral data. Our experiments, conducted on the Indian Pines, Pavia University, and PRISMA datasets, show that AMBER outperforms traditional CNN-based methods in terms of Overall Accuracy, Kappa coefficient, and Average Accuracy on the first two datasets, and achieves state-of-the-art performance on the PRISMA dataset.

[AI-129] LLM-Powered Ensemble Learning for Paper Source Tracing: A GPU-Free Approach

链接: https://arxiv.org/abs/2409.09383
作者: Kunlong Chen,Junjun Wang,Zhaoqun Chen,Kunjin Chen,Yitian Chen
关键词-EN: KDD CUP, source tracing competition, paper source tracing, tracing competition, source tracing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:We participated in the KDD CUP 2024 paper source tracing competition and achieved the 3rd place. This competition tasked participants with identifying the reference sources (i.e., ref-sources, as referred to by the organizers of the competition) of given academic papers. Unlike most teams that addressed this challenge by fine-tuning pre-trained neural language models such as BERT or ChatGLM, our primary approach utilized closed-source large language models (LLMs). With recent advancements in LLM technology, closed-source LLMs have demonstrated the capability to tackle complex reasoning tasks in zero-shot or few-shot scenarios. Consequently, in the absence of GPUs, we employed closed-source LLMs to directly generate predicted reference sources from the provided papers. We further refined these predictions through ensemble learning. Notably, our method was the only one among the award-winning approaches that did not require the use of GPUs for model training. Code available at this https URL.

[AI-130] Prevailing Research Areas for Music AI in the Era of Foundation Models

链接: https://arxiv.org/abs/2409.09378
作者: Megan Wei,Mateusz Modrzejewski,Aswin Sivaraman,Dorien Herremans
关键词-EN: generative models, past few years, recent advancements, generative, music
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:In tandem with the recent advancements in foundation model research, there has been a surge of generative music AI applications within the past few years. As the idea of AI-generated or AI-augmented music becomes more mainstream, many researchers in the music AI community may be wondering what avenues of research are left. With regards to music generative models, we outline the current areas of research with significant room for exploration. Firstly, we pose the question of foundational representation of these generative models and investigate approaches towards explainability. Next, we discuss the current state of music datasets and their limitations. We then overview different generative models, forms of evaluating these models, and their computational constraints/limitations. Subsequently, we highlight applications of these generative models towards extensions to multiple modalities and integration with artists’ workflow as well as music education systems. Finally, we survey the potential copyright implications of generative music and discuss strategies for protecting the rights of musicians. While it is not meant to be exhaustive, our survey calls to attention a variety of research directions enabled by music foundation models.

[AI-131] LACOSTE: Exploiting stereo and temporal contexts for surgical instrument segmentation

链接: https://arxiv.org/abs/2409.09360
作者: Qiyuan Wang,Shang Zhao,Zikang Xu,S Kevin Zhou
关键词-EN: minimally invasive surgeries, Surgical instrument segmentation, related applications, instrumental to minimally, minimally invasive
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Preprint submitted to Medical Image Analysis

点击查看摘要

Abstract:Surgical instrument segmentation is instrumental to minimally invasive surgeries and related applications. Most previous methods formulate this task as single-frame-based instance segmentation while ignoring the natural temporal and stereo attributes of a surgical video. As a result, these methods are less robust against the appearance variation through temporal motion and view change. In this work, we propose a novel LACOSTE model that exploits Location-Agnostic COntexts in Stereo and TEmporal images for improved surgical instrument segmentation. Leveraging a query-based segmentation model as core, we design three performance-enhancing modules. Firstly, we design a disparity-guided feature propagation module to enhance depth-aware features explicitly. To generalize well for even only a monocular video, we apply a pseudo stereo scheme to generate complementary right images. Secondly, we propose a stereo-temporal set classifier, which aggregates stereo-temporal contexts in a universal way for making a consolidated prediction and mitigates transient failures. Finally, we propose a location-agnostic classifier to decouple the location bias from mask prediction and enhance the feature semantics. We extensively validate our approach on three public surgical video datasets, including two benchmarks from EndoVis Challenges and one real radical prostatectomy surgery dataset GraSP. Experimental results demonstrate the promising performances of our method, which consistently achieves comparable or favorable results with previous state-of-the-art approaches.

[AI-132] Symbolic Regression with a Learned Concept Library

链接: https://arxiv.org/abs/2409.09359
作者: Arya Grayeli,Atharva Sehgal,Omar Costilla-Reyes,Miles Cranmer,Swarat Chaudhuri
关键词-EN: compact programmatic hypotheses, symbolic regression, explain a dataset, searching for compact, compact programmatic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Symbolic Computation (cs.SC)
*备注: preprint version; 10 pages

点击查看摘要

Abstract:We present a novel method for symbolic regression (SR), the task of searching for compact programmatic hypotheses that best explain a dataset. The problem is commonly solved using genetic algorithms; we show that we can enhance such methods by inducing a library of abstract textual concepts. Our algorithm, called LaSR, uses zero-shot queries to a large language model (LLM) to discover and evolve concepts occurring in known high-performing hypotheses. We discover new hypotheses using a mix of standard evolutionary steps and LLM-guided steps (obtained through zero-shot LLM queries) conditioned on discovered concepts. Once discovered, hypotheses are used in a new round of concept abstraction and evolution. We validate LaSR on the Feynman equations, a popular SR benchmark, as well as a set of synthetic tasks. On these benchmarks, LaSR substantially outperforms a variety of state-of-the-art SR approaches based on deep learning and evolutionary algorithms. Moreover, we show that LaSR can be used to discover a novel and powerful scaling law for LLMs.

[AI-133] Joint Semantic Knowledge Distillation and Masked Acoustic Modeling for Full-band Speech Restoration with Improved Intelligibility

链接: https://arxiv.org/abs/2409.09357
作者: Xiaoyu Liu,Xu Li,Joan Serrà,Santiago Pascual
关键词-EN: Speech restoration aims, restoring full-band speech, set of distortions, restoration aims, aims at restoring
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注: Demo link this https URL

点击查看摘要

Abstract:Speech restoration aims at restoring full-band speech with high quality and intelligibility, considering a diverse set of distortions. MaskSR is a recently proposed generative model for this task. As other models of its kind, MaskSR attains high quality but, as we show, intelligibility can be substantially improved. We do so by boosting the speech encoder component of MaskSR with predictions of semantic representations of the target speech, using a pre-trained self-supervised teacher model. Then, a masked language model is conditioned on the learned semantic features to predict acoustic tokens that encode low level spectral details of the target speech. We show that, with the same MaskSR model capacity and inference time, the proposed model, MaskSR2, significantly reduces the word error rate, a typical metric for intelligibility. MaskSR2 also achieves competitive word error rate among other models, while providing superior quality. An ablation study shows the effectiveness of various semantic representations.

[AI-134] PeriGuru: A Peripheral Robotic Mobile App Operation Assistant based on GUI Image Understanding and Prompting with LLM

链接: https://arxiv.org/abs/2409.09354
作者: Kelin Fu,Yang Tian,Kaigui Bian
关键词-EN: mobile app operation, Large Language Model, daily learning, modern life, significantly enhanced
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Smartphones have significantly enhanced our daily learning, communication, and entertainment, becoming an essential component of modern life. However, certain populations, including the elderly and individuals with disabilities, encounter challenges in utilizing smartphones, thus necessitating mobile app operation assistants, a.k.a. mobile app agent. With considerations for privacy, permissions, and cross-platform compatibility issues, we endeavor to devise and develop PeriGuru in this work, a peripheral robotic mobile app operation assistant based on GUI image understanding and prompting with Large Language Model (LLM). PeriGuru leverages a suite of computer vision techniques to analyze GUI screenshot images and employs LLM to inform action decisions, which are then executed by robotic arms. PeriGuru achieves a success rate of 81.94% on the test task set, which surpasses by more than double the method without PeriGuru’s GUI image interpreting and prompting design. Our code is available on this https URL.

[AI-135] Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models

链接: https://arxiv.org/abs/2409.09345
作者: Yuanzhao Zhai,Tingkai Yang,Kele Xu,Feng Dawei,Cheng Yang,Bo Ding,Huaimin Wang
关键词-EN: standalone Large Language, Large Language Models, Large Language, standalone Large, Language Models
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Agents significantly enhance the capabilities of standalone Large Language Models (LLMs) by perceiving environments, making decisions, and executing actions. However, LLM agents still face challenges in tasks that require multiple decision-making steps. Estimating the value of actions in specific tasks is difficult when intermediate actions are neither appropriately rewarded nor penalized. In this paper, we propose leveraging a task-relevant Q-value model to guide action selection. Specifically, we first collect decision-making trajectories annotated with step-level Q values via Monte Carlo Tree Search (MCTS) and construct preference data. We then use another LLM to fit these preferences through step-level Direct Policy Optimization (DPO), which serves as the Q-value model. During inference, at each decision-making step, LLM agents select the action with the highest Q value before interacting with the environment. We apply our method to various open-source and API-based LLM agents, demonstrating that Q-value models significantly improve their performance. Notably, the performance of the agent built with Phi-3-mini-4k-instruct improved by 103% on WebShop and 75% on HotPotQA when enhanced with Q-value models, even surpassing GPT-4o-mini. Additionally, Q-value models offer several advantages, such as generalization to different LLM agents and seamless integration with existing prompting strategies.

[AI-136] Egocentric Speaker Classification in Child-Adult Dyadic Interactions: From Sensing to Computational Modeling

链接: https://arxiv.org/abs/2409.09340
作者: Tiantian Feng,Anfeng Xu,Xuan Shi,Somer Bishop,Shrikanth Narayanan
关键词-EN: Autism spectrum disorder, neurodevelopmental condition characterized, Autism spectrum, spectrum disorder, social communication
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: pre-print under review

点击查看摘要

Abstract:Autism spectrum disorder (ASD) is a neurodevelopmental condition characterized by challenges in social communication, repetitive behavior, and sensory processing. One important research area in ASD is evaluating children’s behavioral changes over time during treatment. The standard protocol with this objective is BOSCC, which involves dyadic interactions between a child and clinicians performing a pre-defined set of activities. A fundamental aspect of understanding children’s behavior in these interactions is automatic speech understanding, particularly identifying who speaks and when. Conventional approaches in this area heavily rely on speech samples recorded from a spectator perspective, and there is limited research on egocentric speech modeling. In this study, we design an experiment to perform speech sampling in BOSCC interviews from an egocentric perspective using wearable sensors and explore pre-training Ego4D speech samples to enhance child-adult speaker classification in dyadic interactions. Our findings highlight the potential of egocentric speech collection and pre-training to improve speaker classification accuracy.

[AI-137] Efficient Fine-Tuning of Large Language Models for Automated Medical Documentation

链接: https://arxiv.org/abs/2409.09324
作者: Hui Yi Leong,Yi Fan Gao,Ji Shuai,Uktu Pamuksuz
关键词-EN: electronic health records, Scientific research, direct patient care, health records, desk work
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 4 pages, 3 Figures, 3 Tables, This is a preprint version of the article. The final version will be published in the proceedings of the IEEE conference

点击查看摘要

Abstract:Scientific research indicates that for every hour spent in direct patient care, physicians spend nearly two additional hours on administrative tasks, particularly on electronic health records (EHRs) and desk work. This excessive administrative burden not only reduces the time available for patient care but also contributes to physician burnout and inefficiencies in healthcare delivery. To address these challenges, this study introduces MediGen, a fine-tuned large language model (LLM) designed to automate the generation of medical reports from medical dialogues. By leveraging state-of-the-art methodologies for fine-tuning open-source pretrained models, including LLaMA3-8B, MediGen achieves high accuracy in transcribing and summarizing clinical interactions. The fine-tuned LLaMA3-8B model demonstrated promising results, achieving a ROUGE score of 58% and a BERTScore-F1 of 72%, indicating its effectiveness in generating accurate and clinically relevant medical reports. These findings suggest that MediGen has the potential to significantly reduce the administrative workload on physicians, improving both healthcare efficiency and physician well-being.

[AI-138] he T05 System for The VoiceMOS Challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech

链接: https://arxiv.org/abs/2409.09305
作者: Kaito Baba,Wataru Nakata,Yuki Saito,Hiroshi Saruwatari
关键词-EN: VoiceMOS Challenge, VMC, Challenge, system, synthetic speech
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted by IEEE SLT 2024. Our MOS prediction system (UTMOSv2) is available in this https URL

点击查看摘要

Abstract:We present our system (denoted as T05) for the VoiceMOS Challenge (VMC) 2024. Our system was designed for the VMC 2024 Track 1, which focused on the accurate prediction of naturalness mean opinion score (MOS) for high-quality synthetic speech. In addition to a pretrained self-supervised learning (SSL)-based speech feature extractor, our system incorporates a pretrained image feature extractor to capture the difference of synthetic speech observed in speech spectrograms. We first separately train two MOS predictors that use either of an SSL-based or spectrogram-based feature. Then, we fine-tune the two predictors for better MOS prediction using the fusion of two extracted features. In the VMC 2024 Track 1, our T05 system achieved first place in 7 out of 16 evaluation metrics and second place in the remaining 9 metrics, with a significant difference compared to those ranked third and below. We also report the results of our ablation study to investigate essential factors of our system.

[AI-139] Matrix Profile for Anomaly Detection on Multidimensional Time Series

链接: https://arxiv.org/abs/2409.09298
作者: Chin-Chia Michael Yeh,Audrey Der,Uday Singh Saini,Vivian Lai,Yan Zheng,Junpeng Wang,Xin Dai,Zhongfang Zhuang,Yujie Fan,Huiyuan Chen,Prince Osei Aboagye,Liang Wang,Wei Zhang,Eamonn Keogh
关键词-EN: time series, multidimensional time series, anomaly detection, series data mining, series anomaly detection
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:The Matrix Profile (MP), a versatile tool for time series data mining, has been shown effective in time series anomaly detection (TSAD). This paper delves into the problem of anomaly detection in multidimensional time series, a common occurrence in real-world applications. For instance, in a manufacturing factory, multiple sensors installed across the site collect time-varying data for analysis. The Matrix Profile, named for its role in profiling the matrix storing pairwise distance between subsequences of univariate time series, becomes complex in multidimensional scenarios. If the input univariate time series has n subsequences, the pairwise distance matrix is a n x n matrix. In a multidimensional time series with d dimensions, the pairwise distance information must be stored in a n x n x d tensor. In this paper, we first analyze different strategies for condensing this tensor into a profile vector. We then investigate the potential of extending the MP to efficiently find k-nearest neighbors for anomaly detection. Finally, we benchmark the multidimensional MP against 19 baseline methods on 119 multidimensional TSAD datasets. The experiments covers three learning setups: unsupervised, supervised, and semi-supervised. MP is the only method that consistently delivers high performance across all setups.

[AI-140] Language Models “Grok” to Copy

链接: https://arxiv.org/abs/2409.09281
作者: Ang Lv,Ruobing Xie,Xingwu Sun,Zhanhui Kang,Rui Yan
关键词-EN: LLM applications, including in-context learning, Transformer-based language models, retrieval-augmented generation, copy text
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 5 pages, 7 figures

点击查看摘要

Abstract:We examine the pre-training dynamics of language models, focusing on their ability to copy text from preceding context–a fundamental skill for various LLM applications, including in-context learning (ICL) and retrieval-augmented generation (RAG). We propose a novel perspective that Transformer-based language models develop copying abilities similarly to grokking, which refers to sudden generalization on test set long after the model fit to the training set. Our experiments yield three arguments: (1) The pre-training loss decreases rapidly, while the context copying ability of models initially lags and then abruptly saturates. (2) The speed of developing copying ability is independent of the number of tokens trained, similarly to how grokking speed is unaffected by dataset size as long as the data distribution is preserved. (3) Induction heads, the attention heads responsible for copying, form from shallow to deep layers during training, mirroring the development of circuits in deeper layers during grokking. We contend that the connection between grokking and context copying can provide valuable insights for more effective language model training, ultimately improving in-context performance. For example, we demonstrated that techniques that enhance grokking, such as regularization, either accelerate or enhance the development of context copying.

[AI-141] An empirical evaluation of using ChatGPT to summarize disputes for recommending similar labor and employment cases in Chinese

链接: https://arxiv.org/abs/2409.09280
作者: Po-Hsien Wu,Chao-Lin Liu,Wei-Jie Li
关键词-EN: recommending similar cases, disputes, employment litigations, mechanism for recommending, recommending similar
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 14 pages, 5 figures, 2 tables, the 18th Int’l Workshop on Juris-Informatics (JURISIN 2024), associated with the 16th JSAI International Symposium on AI (JSAI-isAI 2024)

点击查看摘要

Abstract:We present a hybrid mechanism for recommending similar cases of labor and employment litigations. The classifier determines the similarity based on the itemized disputes of the two cases, that the courts prepared. We cluster the disputes, compute the cosine similarity between the disputes, and use the results as the features for the classification tasks. Experimental results indicate that this hybrid approach outperformed our previous system, which considered only the information about the clusters of the disputes. We replaced the disputes that were prepared by the courts with the itemized disputes that were generated by GPT-3.5 and GPT-4, and repeated the same experiments. Using the disputes generated by GPT-4 led to better results. Although our classifier did not perform as well when using the disputes that the ChatGPT generated, the results were satisfactory. Hence, we hope that the future large-language models will become practically useful.

[AI-142] LabellessFace: Fair Metric Learning for Face Recognition without Attribute Labels

链接: https://arxiv.org/abs/2409.09274
作者: Tetsushi Ohki,Yuya Sato,Masakatsu Nishigaki,Koichi Ito
关键词-EN: major challenges, Demographic, Demographic bias, face recognition, recognition systems
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Demographic bias is one of the major challenges for face recognition systems. The majority of existing studies on demographic biases are heavily dependent on specific demographic groups or demographic classifier, making it difficult to address performance for unrecognised groups. This paper introduces ``LabellessFace’', a novel framework that improves demographic bias in face recognition without requiring demographic group labeling typically required for fairness considerations. We propose a novel fairness enhancement metric called the class favoritism level, which assesses the extent of favoritism towards specific classes across the dataset. Leveraging this metric, we introduce the fair class margin penalty, an extension of existing margin-based metric learning. This method dynamically adjusts learning parameters based on class favoritism levels, promoting fairness across all attributes. By treating each class as an individual in facial recognition systems, we facilitate learning that minimizes biases in authentication accuracy among individuals. Comprehensive experiments have demonstrated that our proposed method is effective for enhancing fairness while maintaining authentication accuracy.

[AI-143] SafeEar: Content Privacy-Preserving Audio Deepfake Detection CCS2024 CCS

链接: https://arxiv.org/abs/2409.09272
作者: Xinfeng Li,Kai Li,Yifan Zheng,Chen Yan,Xiaoyu Ji,Wenyuan Xu
关键词-EN: Voice Conversion, exhibited remarkable performance, exhibited remarkable, remarkable performance, performance in generating
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted by ACM CCS 2024. Please cite this paper as “Xinfeng Li, Kai Li, Yifan Zheng, Chen Yan, Xiaoyu Ji, Wenyuan Xu. SafeEar: Content Privacy-Preserving Audio Deepfake Detection. In Proceedings of ACM Conference on Computer and Communications Security (CCS), 2024.”

点击查看摘要

Abstract:Text-to-Speech (TTS) and Voice Conversion (VC) models have exhibited remarkable performance in generating realistic and natural audio. However, their dark side, audio deepfake poses a significant threat to both society and individuals. Existing countermeasures largely focus on determining the genuineness of speech based on complete original audio recordings, which however often contain private content. This oversight may refrain deepfake detection from many applications, particularly in scenarios involving sensitive information like business secrets. In this paper, we propose SafeEar, a novel framework that aims to detect deepfake audios without relying on accessing the speech content within. Our key idea is to devise a neural audio codec into a novel decoupling model that well separates the semantic and acoustic information from audio samples, and only use the acoustic information (e.g., prosody and timbre) for deepfake detection. In this way, no semantic content will be exposed to the detector. To overcome the challenge of identifying diverse deepfake audio without semantic clues, we enhance our deepfake detector with real-world codec augmentation. Extensive experiments conducted on four benchmark datasets demonstrate SafeEar’s effectiveness in detecting various deepfake techniques with an equal error rate (EER) down to 2.02%. Simultaneously, it shields five-language speech content from being deciphered by both machine and human auditory analysis, demonstrated by word error rates (WERs) all above 93.93% and our user study. Furthermore, our benchmark constructed for anti-deepfake and anti-content recovery evaluation helps provide a basis for future research in the realms of audio privacy preservation and deepfake detection.

[AI-144] Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks Domains and Knowledge Types

链接: https://arxiv.org/abs/2409.09269
作者: Neelabh Sinha,Vinija Jain,Aman Chadha
关键词-EN: aid user experience, Visual Question-Answering, achieving good results, user experience, zero-shot inference
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 8 pages + references + 6 pages of Appendix

点击查看摘要

Abstract:Visual Question-Answering (VQA) has become a key use-case in several applications to aid user experience, particularly after Vision-Language Models (VLMs) achieving good results in zero-shot inference. But evaluating different VLMs for an application requirement using a standardized framework in practical settings is still challenging. This paper introduces a comprehensive framework for evaluating VLMs tailored to VQA tasks in practical settings. We present a novel dataset derived from established VQA benchmarks, annotated with task types, application domains, and knowledge types, three key practical aspects on which tasks can vary. We also introduce GoEval, a multimodal evaluation metric developed using GPT-4o, achieving a correlation factor of 56.71% with human judgments. Our experiments with ten state-of-the-art VLMs reveals that no single model excelling universally, making appropriate selection a key design decision. Proprietary models such as Gemini-1.5-Pro and GPT-4o-mini generally outperform others, though open-source models like InternVL-2-8B and CogVLM-2-Llama-3-19B demonstrate competitive strengths in specific contexts, while providing additional advantages. This study guides the selection of VLMs based on specific task requirements and resource constraints, and can also be extended to other vision-language tasks.

[AI-145] Operational Wind Speed Forecasts for Chiles Electric Power Sector Using a Hybrid ML Model

链接: https://arxiv.org/abs/2409.09263
作者: Dhruv Suri,Praneet Dutta,Flora Xue,Ines Azevedo,Ravi Jain
关键词-EN: managing grid operations, electric power sector, power sector advances, Chile electric power, renewable energy sources
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:As Chile’s electric power sector advances toward a future powered by renewable energy, accurate forecasting of renewable generation is essential for managing grid operations. The integration of renewable energy sources is particularly challenging due to the operational difficulties of managing their power generation, which is highly variable compared to fossil fuel sources, delaying the availability of clean energy. To mitigate this, we quantify the impact of increasing intermittent generation from wind and solar on thermal power plants in Chile and introduce a hybrid wind speed forecasting methodology which combines two custom ML models for Chile. The first model is based on TiDE, an MLP-based ML model for short-term forecasts, and the second is based on a graph neural network, GraphCast, for medium-term forecasts up to 10 days. Our hybrid approach outperforms the most accurate operational deterministic systems by 4-21% for short-term forecasts and 5-23% for medium-term forecasts and can directly lower the impact of wind generation on thermal ramping, curtailment, and system-level emissions in Chile.

[AI-146] What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data Slicing

链接: https://arxiv.org/abs/2409.09261
作者: Chenyang Yang,Yining Hong,Grace A. Lewis,Tongshuang Wu,Christian Kästner
关键词-EN: Machine learning models, models make mistakes, Machine learning, learning models make, make mistakes
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning models make mistakes, yet sometimes it is difficult to identify the systematic problems behind the mistakes. Practitioners engage in various activities, including error analysis, testing, auditing, and red-teaming, to form hypotheses of what can go (or has gone) wrong with their models. To validate these hypotheses, practitioners employ data slicing to identify relevant examples. However, traditional data slicing is limited by available features and programmatic slicing functions. In this work, we propose SemSlicer, a framework that supports semantic data slicing, which identifies a semantically coherent slice, without the need for existing features. SemSlicer uses Large Language Models to annotate datasets and generate slices from any user-defined slicing criteria. We show that SemSlicer generates accurate slices with low cost, allows flexible trade-offs between different design dimensions, reliably identifies under-performing data slices, and helps practitioners identify useful data slices that reflect systematic problems.

[AI-147] Unleash LLMs Potential for Recommendation by Coordinating Twin-Tower Dynamic Semantic Token Generator

链接: https://arxiv.org/abs/2409.09253
作者: Jun Yin,Zhengxin Zeng,Mingzheng Li,Hao Yan,Chaozhuo Li,Weihao Han,Jianjin Zhang,Ruochen Liu,Allen Sun,Denvy Deng,Feng Sun,Qi Zhang,Shirui Pan,Senzhang Wang
关键词-EN: large language models, pre-trained large language, shown fantastic potential, next-generation recommender systems, semantic index
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Owing to the unprecedented capability in semantic understanding and logical reasoning, the pre-trained large language models (LLMs) have shown fantastic potential in developing the next-generation recommender systems (RSs). However, the static index paradigm adopted by current methods greatly restricts the utilization of LLMs capacity for recommendation, leading to not only the insufficient alignment between semantic and collaborative knowledge, but also the neglect of high-order user-item interaction patterns. In this paper, we propose Twin-Tower Dynamic Semantic Recommender (TTDS), the first generative RS which adopts dynamic semantic index paradigm, targeting at resolving the above problems simultaneously. To be more specific, we for the first time contrive a dynamic knowledge fusion framework which integrates a twin-tower semantic token generator into the LLM-based recommender, hierarchically allocating meaningful semantic index for items and users, and accordingly predicting the semantic index of target item. Furthermore, a dual-modality variational auto-encoder is proposed to facilitate multi-grained alignment between semantic and collaborative knowledge. Eventually, a series of novel tuning tasks specially customized for capturing high-order user-item interaction patterns are proposed to take advantages of user historical behavior. Extensive experiments across three public datasets demonstrate the superiority of the proposed methodology in developing LLM-based generative RSs. The proposed TTDS recommender achieves an average improvement of 19.41% in Hit-Rate and 20.84% in NDCG metric, compared with the leading baseline methods.

[AI-148] ETAGE: Enhanced Test Time Adaptation with Integrated Entropy and Gradient Norms for Robust Model Performance

链接: https://arxiv.org/abs/2409.09251
作者: Afshar Shamsi,Rejisa Becirovic,Ahmadreza Argha,Ehsan Abbasnejad,Hamid Alinejad-Rokny,Arash Mohammadi
关键词-EN: equips deep learning, unseen test data, handle unseen test, Label Probability Difference, deep learning models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Test time adaptation (TTA) equips deep learning models to handle unseen test data that deviates from the training distribution, even when source data is inaccessible. While traditional TTA methods often rely on entropy as a confidence metric, its effectiveness can be limited, particularly in biased scenarios. Extending existing approaches like the Pseudo Label Probability Difference (PLPD), we introduce ETAGE, a refined TTA method that integrates entropy minimization with gradient norms and PLPD, to enhance sample selection and adaptation. Our method prioritizes samples that are less likely to cause instability by combining high entropy with high gradient norms out of adaptation, thus avoiding the overfitting to noise often observed in previous methods. Extensive experiments on CIFAR-10-C and CIFAR-100-C datasets demonstrate that our approach outperforms existing TTA techniques, particularly in challenging and biased scenarios, leading to more robust and consistent model performance across diverse test scenarios. The codebase for ETAGE is available on this https URL.

[AI-149] Robust Training of Neural Networks at Arbitrary Precision and Sparsity

链接: https://arxiv.org/abs/2409.09245
作者: Chengxi Ye,Grace Chu,Yanfeng Liu,Yichi Zhang,Lukasz Lew,Andrew Howard
关键词-EN: sparsification introduce obstacles, discontinuous operations inherent, obstacles to backpropagation, discontinuous operations, introduce obstacles
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:The discontinuous operations inherent in quantization and sparsification introduce obstacles to backpropagation. This is particularly challenging when training deep neural networks in ultra-low precision and sparse regimes. We propose a novel, robust, and universal solution: a denoising affine transform that stabilizes training under these challenging conditions. By formulating quantization and sparsification as perturbations during training, we derive a perturbation-resilient approach based on ridge regression. Our solution employs a piecewise constant backbone model to ensure a performance lower bound and features an inherent noise reduction mechanism to mitigate perturbation-induced corruption. This formulation allows existing models to be trained at arbitrarily low precision and sparsity levels with off-the-shelf recipes. Furthermore, our method provides a novel perspective on training temporal binary neural networks, contributing to ongoing efforts to narrow the gap between artificial and biological neural networks.

[AI-150] Cross-Entropy Optimization for Hyperparameter Optimization in Stochastic Gradient-based Approaches to Train Deep Neural Networks

链接: https://arxiv.org/abs/2409.09240
作者: Kevin Li,Fulu Li
关键词-EN: stochastic gradient-based approaches, deep neural networks, train deep neural, neural networks, gradient-based approaches
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 6 pages, 2 figures

点击查看摘要

Abstract:In this paper, we present a cross-entropy optimization method for hyperparameter optimization in stochastic gradient-based approaches to train deep neural networks. The value of a hyperparameter of a learning algorithm often has great impact on the performance of a model such as the convergence speed, the generalization performance metrics, etc. While in some cases the hyperparameters of a learning algorithm can be part of learning parameters, in other scenarios the hyperparameters of a stochastic optimization algorithm such as Adam [5] and its variants are either fixed as a constant or are kept changing in a monotonic way over time. We give an in-depth analysis of the presented method in the framework of expectation maximization (EM). The presented algorithm of cross-entropy optimization for hyperparameter optimization of a learning algorithm (CEHPO) can be equally applicable to other areas of optimization problems in deep learning. We hope that the presented methods can provide different perspectives and offer some insights for optimization problems in different areas of machine learning and beyond.

[AI-151] Autoregressive Chain of Thought (CoT) simeq Recurrent: Recurrences Role in Language Models and a Revist of Recurrent Transformer

链接: https://arxiv.org/abs/2409.09239
作者: Xiang Zhang,Muhammad Abdul-Mageed,Laks V.S. Lakshmanan
关键词-EN: RNN and LSTM, Transformer architecture excels, outperforming traditional neural, outperforming traditional, Transformer
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Transformer architecture excels in a variety of language modeling tasks, outperforming traditional neural architectures such as RNN and LSTM. This is partially due to its elimination of recurrent connections, which allows for parallel training and a smoother flow of gradients. However, this move away from recurrent structures places the Transformer model at the lower end of Chomsky’s computational hierarchy, imposing limitations on its computational abilities. Consequently, even advanced Transformer-based models face considerable difficulties in tasks like counting, string reversal, bracket pairing, and multiplication. These tasks, though seemingly elementary, require a level of computational complexity that exceeds the capabilities of the Transformer architecture. Concurrently, the emergence of Chain of Thought" (CoT) prompting has enabled Transformer-based language models to tackle tasks that were previously impossible or poorly executed. Despite some previous research primarily interpreting CoT from a psychological perspective, a comprehensive understanding of \textitwhy CoT proves so effective in the reasoning process remains elusive. In this work, we thoroughly investigate the influence of recurrent structures in language models on their reasoning abilities, shedding light on how the CoT approach can mimic recurrent computation and act as a bridge between autoregression and recurrence. It is this approximated recurrence that notably improves the model's performance and computational capacity. Moreover, we revisit recent recurrent-based Transformer model designs, focusing on their computational abilities through our proposed concept of recurrence-completeness" and identify key theoretical limitations in models like Linear Transformer and RWKV. Through this, we aim to provide insight into the neural model architectures and prompt better model design.

[AI-152] Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases

链接: https://arxiv.org/abs/2409.09201
作者: Mercy Asiedu,Nenad Tomasev,Chintan Ghate,Tiya Tiyasirichokchai,Awa Dieng,Oluwatosin Akande,Geoffrey Siwo,Steve Adudans,Sylvanus Aitkins,Odianosen Ehiakhamen,Katherine Heller
关键词-EN: large language models, limited work focused, infectious disease-specific exploration, medical question answering, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While large language models (LLMs) have shown promise for medical question answering, there is limited work focused on tropical and infectious disease-specific exploration. We build on an opensource tropical and infectious diseases (TRINDs) dataset, expanding it to include demographic and semantic clinical and consumer augmentations yielding 11000+ prompts. We evaluate LLM performance on these, comparing generalist and medical LLMs, as well as LLM outcomes to human experts. We demonstrate through systematic experimentation, the benefit of contextual information such as demographics, location, gender, risk factors for optimal LLM response. Finally we develop a prototype of TRINDs-LM, a research tool that provides a playground to navigate how context impacts LLM outputs for health.

[AI-153] Hierarchical Hypercomplex Network for Multimodal Emotion Recognition

链接: https://arxiv.org/abs/2409.09194
作者: Eleonora Lopez,Aurelio Uncini,Danilo Comminiello
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: The paper has been accepted at MLSP 2024

点击查看摘要

[AI-154] ProcessTBench: An LLM Plan Generation Dataset for Process Mining MICRO

链接: https://arxiv.org/abs/2409.09191
作者: Andrei Cosmin Redis,Mohammadreza Fani Sani,Bahram Zarrin,Andrea Burattin
关键词-EN: Large Language Models, shown significant promise, Large Language, Language Models, plan generation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注: 6 pages, 4 figures, dataset available at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) have shown significant promise in plan generation. Yet, existing datasets often lack the complexity needed for advanced tool use scenarios - such as handling paraphrased query statements, supporting multiple languages, and managing actions that can be done in parallel. These scenarios are crucial for evaluating the evolving capabilities of LLMs in real-world applications. Moreover, current datasets don’t enable the study of LLMs from a process perspective, particularly in scenarios where understanding typical behaviors and challenges in executing the same process under different conditions or formulations is crucial. To address these gaps, we present the ProcessTBench dataset, an extension of the TaskBench dataset specifically designed to evaluate LLMs within a process mining framework.

[AI-155] Incorporation of Verifier Functionality in the Software for Operations and Network Attack Results Review and the Autonomous Penetration Testing System

链接: https://arxiv.org/abs/2409.09174
作者: Jordan Milbrath,Jeremy Straub
关键词-EN:
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: The U.S. federal sponsor has requested that we not include funding acknowledgement for this publication

点击查看摘要

[AI-156] he Challenges of Effective AGM Belief Contraction

链接: https://arxiv.org/abs/2409.09171
作者: Dominik Klumpp,Jandson S. Ribeiro
关键词-EN:
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注: 20 pages, 4 figures

点击查看摘要

[AI-157] Curricula for Learning Robust Policies over Factored State Representations in Changing Environments

链接: https://arxiv.org/abs/2409.09169
作者: Panayiotis Panayiotou,Özgür Şimşek
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 17th European Workshop on Reinforcement Learning (EWRL 2024)

点击查看摘要

[AI-158] Multimodal Fusion with LLMs for Engagement Prediction in Natural Conversation

链接: https://arxiv.org/abs/2409.09135
作者: Cheng Charles Ma,Kevin Hyekang Joo,Alexandria K. Vail,Sunreeta Bhattacharya,Álvaro Fernández García,Kailana Baker-Matsuoka,Sheryl Mathew,Lori L. Holt,Fernando De la Torre
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 22 pages, first three authors equal contribution

点击查看摘要

[AI-159] Neural Message Passing Induced by Energy-Constrained Diffusion ICLR2023

链接: https://arxiv.org/abs/2409.09111
作者: Qitian Wu,David Wipf,Junchi Yan
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Extended version from DIFFormer paper in ICLR2023

点击查看摘要

[AI-160] Proactive and Reactive Constraint Programming for Stochastic Project Scheduling with Maximal Time-Lags

链接: https://arxiv.org/abs/2409.09107
作者: Kim van den Houten,Léon Planken,Esteban Freydell,David M.J. Tax,Mathijs de Weerdt
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-161] Recent Trends in Modelling the Continuous Time Series using Deep Learning: A Survey

链接: https://arxiv.org/abs/2409.09106
作者: Mansura Habiba,Barak A. Pearlmutter,Mehrdad Maleki
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-162] Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU

链接: https://arxiv.org/abs/2409.09086
作者: Zhenyu Ning,Jieru Zhao,Qihao Jin,Wenchao Ding,Minyi Guo
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
*备注:

点击查看摘要

[AI-163] Shadowed AHP for multi-criteria supplier selection

链接: https://arxiv.org/abs/2409.09082
作者: Mohamed Abdel Hameed El-Hawy
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注: 14 pages

点击查看摘要

[AI-164] D3-GNN: Dynamic Distributed Dataflow for Streaming Graph Neural Networks VLDB’24

链接: https://arxiv.org/abs/2409.09079
作者: Rustam Guliyev,Aparajita Haldar,Hakan Ferhatosmanoglu
关键词-EN:
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 14 pages, 7 figures, published at VLDB’24

点击查看摘要

[AI-165] Fair Reinforcement Learning Algorithm for PV Active Control in LV Distribution Networks

链接: https://arxiv.org/abs/2409.09074
作者: Maurizio Vassallo,Amina Benzerga,Alireza Bahmanyar,Damien Ernst
关键词-EN:
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-166] Joint Model Assignment and Resource Allocation for Cost-Effective Mobile Generative Services

链接: https://arxiv.org/abs/2409.09072
作者: Shuangwei Gao,Peng Yang,Yuxin Kong,Feng Lyu,Ning Zhang
关键词-EN:
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-167] ELMS: Elasticized Large Language Models On Mobile Devices

链接: https://arxiv.org/abs/2409.09071
作者: Wangsong Yin,Rongjie Yi,Daliang Xu,Gang Huang,Mengwei Xu,Xuanzhe Liu
关键词-EN:
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注: Technical Report

点击查看摘要

[AI-168] mporal Many-valued Conditional Logics: a Preliminary Report

链接: https://arxiv.org/abs/2409.09069
作者: Mario Alviano,Laura Giordano,Daniele Theseider Dupré
关键词-EN:
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注: 8 pages

点击查看摘要

[AI-169] S-EoH: An Edge Server Task Scheduling Algorithm Based on Evolution of Heuristic

链接: https://arxiv.org/abs/2409.09063
作者: Wang Yatong,Pei Yuchen,Zhao Yuqi
关键词-EN:
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-170] Redefining Data-Centric Design: A New Approach with a Domain Model and Core Data Ontology for Computational Systems

链接: https://arxiv.org/abs/2409.09058
作者: William Johnson,James Davis,Tara Kelly
关键词-EN:
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-171] Evaluating the Performance of Large Language Models in Competitive Programming: A Multi-Year Multi-Grade Analysis

链接: https://arxiv.org/abs/2409.09054
作者: Adrian Marius Dumitran,Adrian Catalin Badea,Stefan-Gabriel Muscalu
关键词-EN:
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
*备注: 7 pages, Inista 2024

点击查看摘要

[AI-172] AI Meets the Classroom: When Does ChatGPT Harm Learning?

链接: https://arxiv.org/abs/2409.09047
作者: Matthias Lehmann,Philipp B. Cornelius,Fabian J. Sting
关键词-EN:
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-173] HyPA-RAG: A Hybrid Parameter Adaptive Retrieval-Augmented Generation System for AI Legal and Policy Applications EMNLP2024

链接: https://arxiv.org/abs/2409.09046
作者: Rishi Kalra,Zekun Wu,Ayesha Gulley,Airlie Hilliard,Xin Guan,Adriano Koshiyama,Philip Treleaven
关键词-EN:
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Under review for the EMNLP 2024 Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual

点击查看摘要

[AI-174] United in Diversity? Contextual Biases in LLM-Based Predictions of the 2024 European Parliament Elections

链接: https://arxiv.org/abs/2409.09045
作者: Leah von der Heyde,Anna-Carolina Haensch,Alexander Wenz
关键词-EN:
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Applications (stat.AP)
*备注:

点击查看摘要

[AI-175] ElasticAI: Creating and Deploying Energy-Efficient Deep Learning Accelerator for Pervasive Computing

链接: https://arxiv.org/abs/2409.09044
作者: Chao Qian,Tianheng Ling,Gregor Schiele
关键词-EN:
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: The paper is accepted by 2023 IEEE International Conference on Pervasive Computing and Communications (Best Demo Award)

点击查看摘要

[AI-176] Semantic Communication for Cooperative Perception using HARQ

链接: https://arxiv.org/abs/2409.09042
作者: Yucheng Sheng,Le Liang,Hao Ye,Shi Jin,Geoffrey Ye Li
关键词-EN:
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-177] Acceptable Use Policies for Foundation Models

链接: https://arxiv.org/abs/2409.09041
作者: Kevin Klyman
关键词-EN:
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 10 pages, 2 figures, 2 tables

点击查看摘要

[AI-178] ChatSUMO: Large Language Model for Automating Traffic Scenario Generation in Simulation of Urban MObility

链接: https://arxiv.org/abs/2409.09040
作者: Shuyang Li,Talha Azfar,Ruimin Ke
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[AI-179] AutoGeo: Automating Geometric Image Dataset Creation for Enhanced Geometry Understanding

链接: https://arxiv.org/abs/2409.09039
作者: Zihan Huang,Tao Wu,Wang Lin,Shengyu Zhang,Jingyuan Chen,Fei Wu
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[AI-180] Evaluating Modern Approaches in 3D Scene Reconstruction: NeRF vs Gaussian-Based Methods

链接: https://arxiv.org/abs/2408.04268
作者: Yiming Zhou,Zixuan Zeng,Andi Chen,Xiaofan Zhou,Haowei Ni,Shiyao Zhang,Panfeng Li,Liangxi Liu,Mengyao Zheng,Xupeng Chen
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by 2024 6th International Conference on Data-driven Optimization of Complex Systems

点击查看摘要

[AI-181] Regional Style and Color Transfer

链接: https://arxiv.org/abs/2404.13880
作者: Zhicheng Ding,Panfeng Li,Qikai Yang,Siyang Li,Qingtian Gong
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by 2024 5th International Conference on Computer Vision, Image and Deep Learning

点击查看摘要

[AI-182] A Comparative Study on Enhancing Prediction in Social Network Advertisement through Data Augmentation

链接: https://arxiv.org/abs/2404.13812
作者: Qikai Yang,Panfeng Li,Xinhe Xu,Zhicheng Ding,Wenjing Zhou,Yi Nian
关键词-EN:
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted by 2024 4th International Conference on Machine Learning and Intelligent Systems Engineering (MLISE)

点击查看摘要

[AI-183] Exploring Diverse Methods in Visual Question Answering

链接: https://arxiv.org/abs/2404.13565
作者: Panfeng Li,Qikai Yang,Xieming Geng,Wenjing Zhou,Zhicheng Ding,Yi Nian
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted by 2024 5th International Conference on Electronic Communication and Artificial Intelligence

点击查看摘要

[AI-184] Confidence Trigger Detection: Accelerating Real-time Tracking-by-detection Systems

链接: https://arxiv.org/abs/1902.00615
作者: Zhicheng Ding,Zhixin Lai,Siyang Li,Panfeng Li,Qikai Yang,Edward Wong
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by 2024 5th International Conference on Electronic Communication and Artificial Intelligence

点击查看摘要

[AI-185] Contextual Hourglass Network for Semantic Segmentation of High Resolution Aerial Imagery

链接: https://arxiv.org/abs/1810.12813
作者: Panfeng Li,Youzuo Lin,Emily Schultz-Fellenz
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by 2024 5th International Conference on Electronic Communication and Artificial Intelligence

点击查看摘要

[AI-186] An Efficient Self-Learning Framework For Interactive Spoken Dialog Systems ICML2024

链接: https://arxiv.org/abs/2409.10515
作者: Hitesh Tulsiani,David M. Chan,Shalini Ghosh,Garima Lalwani,Prabhat Pandey,Ankish Bansal,Sri Garimella,Ariya Rastrow,Björn Hoffmeister
关键词-EN:
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
*备注: Presented at ICML 2024

点击查看摘要

[AI-187] Geometric Clustering for Hardware-Efficient Implementation of Chromatic Dispersion Compensation

链接: https://arxiv.org/abs/2409.10416
作者: Geraldo Gomes,Pedro Freire,Jaroslaw E. Prilepsky,Sergei K. Turitsyn
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-188] MOST: MR reconstruction Optimization for multiple downStream Tasks via continual learning

链接: https://arxiv.org/abs/2409.10394
作者: Hwihun Jeong,Se Young Chun,Jongho Lee
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-189] Neuromorphic Spintronics

链接: https://arxiv.org/abs/2409.10290
作者: Atreya Majumdar,Karin Everschor-Sitte
关键词-EN:
类目: Materials Science (cond-mat.mtrl-sci); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Other Condensed Matter (cond-mat.other); Artificial Intelligence (cs.AI)
*备注: Neuromorphic Spintronics is a chapter of a book titled “Artificial Intelligence and Intelligent Matter”. This is not the final version of the chapter. For the final version, please go to the book published by Springer (the DOI and other details will be put here once the book has been published.)

点击查看摘要

[AI-190] FGR-Net:Interpretable fundus imagegradeability classification based on deepreconstruction learning

链接: https://arxiv.org/abs/2409.10246
作者: Saif Khalid,Hatem A. Rashwan,Saddam Abdulwahab,Mohamed Abdel-Nasser,Facundo Manuel Quiroga,Domenec Puig
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[AI-191] Anatomy of Machines for Markowitz: Decision-Focused Learning for Mean-Variance Portfolio Optimization

链接: https://arxiv.org/abs/2409.09684
作者: Junhyeong Lee,Inwoo Tae,Yongjae Lee
关键词-EN:
类目: Portfolio Management (q-fin.PM); Artificial Intelligence (cs.AI)
*备注: 7 pages, 3 figures, 3 tables

点击查看摘要

[AI-192] Reliable Multi-View Learning with Conformal Prediction for Aortic Stenosis Classification in Echocardiography MICCAI

链接: https://arxiv.org/abs/2409.09680
作者: Ang Nan Gu,Michael Tsang,Hooman Vaseli,Teresa Tsang,Purang Abolmaesumi
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: This preprint has not undergone any post-submission improvements or corrections. The Version of Record of this contribution is published in: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Springer (2024) under the same title

点击查看摘要

[AI-193] Stutter-Solver: End-to-end Multi-lingual Dysfluency Detection

链接: https://arxiv.org/abs/2409.09621
作者: Xuanru Zhou,Cheol Jun Cho,Ayati Sharma,Brittany Morin,David Baquirin,Jet Vonk,Zoe Ezzes,Zachary Miller,Boon Lead Tee,Maria Luisa Gorno Tempini,Jiachen Lian,Gopala Anumanchipalli
关键词-EN:
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: IEEE Spoken Language Technology Workshop 2024

点击查看摘要

[AI-194] From FDG to PSMA: A Hitchhikers Guide to Multitracer Multicenter Lesion Segmentation in PET/CT Imaging

链接: https://arxiv.org/abs/2409.09478
作者: Maximilian Rokuss,Balint Kovacs,Yannick Kirchhoff,Shuhan Xiao,Constantin Ulrich,Klaus H. Maier-Hein,Fabian Isensee
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[AI-195] xt Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation ICASSP2025

链接: https://arxiv.org/abs/2409.09381
作者: Chenxu Xiong,Ruibo Fu,Shuchen Shi,Zhengqi Wen,Jianhua Tao,Tao Wang,Chenxing Li,Chunyu Qiang,Yuankun Xie,Xin Qi,Guanjun Li,Zizheng Yang
关键词-EN:
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: 5 pages, 2 figures, submitted to ICASSP 2025

点击查看摘要

[AI-196] Wave-U-Mamba: An End-To-End Framework For High-Quality And Efficient Speech Super Resolution

链接: https://arxiv.org/abs/2409.09337
作者: Yongjoon Lee,Chanwoo Kim
关键词-EN:
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

[AI-197] Phikon-v2 A large and public feature extractor for biomarker prediction

链接: https://arxiv.org/abs/2409.09173
作者: Alexandre Filiot,Paul Jacob,Alice Mac Kain,Charlie Saillard
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[AI-198] Deep learning-based classification of breast cancer molecular subtypes from HE whole-slide images

链接: https://arxiv.org/abs/2409.09053
作者: Masoud Tafavvoghi,Anders Sildnes,Mehrdad Rakaee,Nikita Shvetsov,Lars Ailo Bongo,Lill-Tove Rasmussen Busund,Kajsa Møllersen
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 5 figures (+4 supplementary figures), 4 tables

点击查看摘要

[AI-199] OrthoDoc: Multimodal Large Language Model for Assisting Diagnosis in Computed Tomography

链接: https://arxiv.org/abs/2409.09052
作者: Youzhu Jin,Yichen Zhang
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 1 figure

点击查看摘要

[AI-200] Harnessing Earnings Reports for Stock Predictions: A QLoRA-Enhanced LLM Approach

链接: https://arxiv.org/abs/2408.06634
作者: Haowei Ni,Shuchen Meng,Xupeng Chen,Ziqing Zhao,Andi Chen,Panfeng Li,Shiyao Zhang,Qifu Yin,Yuanqing Wang,Yuxi Chan
关键词-EN:
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Statistical Finance (q-fin.ST)
*备注: Accepted by 2024 6th International Conference on Data-driven Optimization of Complex Systems

点击查看摘要

计算机视觉

[CV-0] Do Pre-trained Vision-Language Models Encode Object States?

链接: https://arxiv.org/abs/2409.10488
作者: Kaleb Newman,Shijie Wang,Yuan Zang,David Heffren,Chen Sun
关键词-EN: sliced apple, evolve over time, capture the temporal, temporal dynamics, encode object states
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:For a vision-language model (VLM) to understand the physical world, such as cause and effect, a first step is to capture the temporal dynamics of the visual world, for example how the physical states of objects evolve over time (e.g. a whole apple into a sliced apple). Our paper aims to investigate if VLMs pre-trained on web-scale data learn to encode object states, which can be extracted with zero-shot text prompts. We curate an object state recognition dataset ChangeIt-Frames, and evaluate nine open-source VLMs, including models trained with contrastive and generative objectives. We observe that while these state-of-the-art vision-language models can reliably perform object recognition, they consistently fail to accurately distinguish the objects’ physical states. Through extensive experiments, we identify three areas for improvements for VLMs to better encode object states, namely the quality of object localization, the architecture to bind concepts to objects, and the objective to learn discriminative visual and language encoders on object states. Data and code are released.

[CV-1] Exploring 3D Face Reconstruction and Fusion Methods for Face Verification: A Case-Study in Video Surveillance ECCV2024

链接: https://arxiv.org/abs/2409.10481
作者: Simone Maurizio La Cava,Sara Concas,Ruben Tolosana,Roberto Casula,Giulia Orrù,Martin Drahansky,Julian Fierrez,Gian Luca Marcialis
关键词-EN: specific assumptions tailored, distinct application scenarios, based on specific, tailored to distinct, specific assumptions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at T-CAP - Towards a Complete Analysis of People: Fine-grained Understanding for Real-World Applications, workshop in conjunction with the 18th European Conference on Computer Vision ECCV 2024

点击查看摘要

Abstract:3D face reconstruction (3DFR) algorithms are based on specific assumptions tailored to distinct application scenarios. These assumptions limit their use when acquisition conditions, such as the subject’s distance from the camera or the camera’s characteristics, are different than expected, as typically happens in video surveillance. Additionally, 3DFR algorithms follow various strategies to address the reconstruction of a 3D shape from 2D data, such as statistical model fitting, photometric stereo, or deep learning. In the present study, we explore the application of three 3DFR algorithms representative of the SOTA, employing each one as the template set generator for a face verification system. The scores provided by each system are combined by score-level fusion. We show that the complementarity induced by different 3DFR algorithms improves performance when tests are conducted at never-seen-before distances from the camera and camera characteristics (cross-distance and cross-camera settings), thus encouraging further investigations on multiple 3DFR-based approaches.

[CV-2] SimInversion: A Simple Framework for Inversion-Based Text-to-Image Editing

链接: https://arxiv.org/abs/2409.10476
作者: Qi Qian,Haiyang Xu,Ming Yan,Juhua Hu
关键词-EN: Diffusion models demonstrate, models demonstrate impressive, demonstrate impressive image, impressive image generation, DDIM inversion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models demonstrate impressive image generation performance with text guidance. Inspired by the learning process of diffusion, existing images can be edited according to text by DDIM inversion. However, the vanilla DDIM inversion is not optimized for classifier-free guidance and the accumulated error will result in the undesired performance. While many algorithms are developed to improve the framework of DDIM inversion for editing, in this work, we investigate the approximation error in DDIM inversion and propose to disentangle the guidance scale for the source and target branches to reduce the error while keeping the original framework. Moreover, a better guidance scale (i.e., 0.5) than default settings can be derived theoretically. Experiments on PIE-Bench show that our proposal can improve the performance of DDIM inversion dramatically without sacrificing efficiency.

[CV-3] MacDiff: Unified Skeleton Modeling with Masked Conditional Diffusion ECCV2024

链接: https://arxiv.org/abs/2409.10473
作者: Lehong Wu,Lilang Lin,Jiahang Zhang,Yiyang Ma,Jiaying Liu
关键词-EN: human action understanding, skeleton-based human action, Self-supervised learning, action understanding, learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:Self-supervised learning has proved effective for skeleton-based human action understanding. However, previous works either rely on contrastive learning that suffers false negative problems or are based on reconstruction that learns too much unessential low-level clues, leading to limited representations for downstream tasks. Recently, great advances have been made in generative learning, which is naturally a challenging yet meaningful pretext task to model the general underlying data distributions. However, the representation learning capacity of generative models is under-explored, especially for the skeletons with spacial sparsity and temporal redundancy. To this end, we propose Masked Conditional Diffusion (MacDiff) as a unified framework for human skeleton modeling. For the first time, we leverage diffusion models as effective skeleton representation learners. Specifically, we train a diffusion decoder conditioned on the representations extracted by a semantic encoder. Random masking is applied to encoder inputs to introduce a information bottleneck and remove redundancy of skeletons. Furthermore, we theoretically demonstrate that our generative objective involves the contrastive learning objective which aligns the masked and noisy views. Meanwhile, it also enforces the representation to complement for the noisy view, leading to better generalization performance. MacDiff achieves state-of-the-art performance on representation learning benchmarks while maintaining the competence for generative tasks. Moreover, we leverage the diffusion model for data augmentation, significantly enhancing the fine-tuning performance in scenarios with scarce labeled data. Our project is available at this https URL.

[CV-4] Deep-Wide Learning Assistance for Insect Pest Classification

链接: https://arxiv.org/abs/2409.10445
作者: Toan Nguyen,Huy Nguyen,Huy Ung,Hieu Ung,Binh Nguyen
关键词-EN: Accurate insect pest, pest recognition plays, Accurate insect, role in agriculture, insect pest recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Accurate insect pest recognition plays a critical role in agriculture. It is a challenging problem due to the intricate characteristics of insects. In this paper, we present DeWi, novel learning assistance for insect pest classification. With a one-stage and alternating training strategy, DeWi simultaneously improves several Convolutional Neural Networks in two perspectives: discrimination (by optimizing a triplet margin loss in a supervised training manner) and generalization (via data augmentation). From that, DeWi can learn discriminative and in-depth features of insect pests (deep) yet still generalize well to a large number of insect categories (wide). Experimental results show that DeWi achieves the highest performances on two insect pest classification benchmarks (76.44% accuracy on the IP102 dataset and 99.79% accuracy on the D0 dataset, respectively). In addition, extensive evaluations and ablation studies are conducted to thoroughly investigate our DeWi and demonstrate its superiority. Our source code is available at this https URL.

[CV-5] CtRNet-X: Camera-to-Robot Pose Estimation in Real-world Conditions Using a Single Camera

链接: https://arxiv.org/abs/2409.10441
作者: Jingpei Lu,Zekai Liang,Tristin Xie,Florian Ritcher,Shan Lin,Sainan Liu,Michael C. Yip
关键词-EN: pose estimation methods, markerless pose estimation, pose estimation, make it accurate, requires effort
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 5 figures, project website: this https URL

点击查看摘要

Abstract:Camera-to-robot calibration is crucial for vision-based robot control and requires effort to make it accurate. Recent advancements in markerless pose estimation methods have eliminated the need for time-consuming physical setups for camera-to-robot calibration. While the existing markerless pose estimation methods have demonstrated impressive accuracy without the need for cumbersome setups, they rely on the assumption that all the robot joints are visible within the camera’s field of view. However, in practice, robots usually move in and out of view, and some portion of the robot may stay out-of-frame during the whole manipulation task due to real-world constraints, leading to a lack of sufficient visual features and subsequent failure of these approaches. To address this challenge and enhance the applicability to vision-based robot control, we propose a novel framework capable of estimating the robot pose with partially visible robot manipulators. Our approach leverages the Vision-Language Models for fine-grained robot components detection, and integrates it into a keypoint-based pose estimation network, which enables more robust performance in varied operational conditions. The framework is evaluated on both public robot datasets and self-collected partial-view datasets to demonstrate our robustness and generalizability. As a result, this method is effective for robot pose estimation in a wider range of real-world manipulation scenarios.

[CV-6] Learning Semi-Supervised Medical Image Segmentation from Spatial Registration

链接: https://arxiv.org/abs/2409.10422
作者: Qianying Liu,Paul Henderson,Xiao Gu,Hang Dai,Fani Deligianni
关键词-EN: abundant unlabeled data, limited labeled data, shown promise, promise in training, training models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Semi-supervised medical image segmentation has shown promise in training models with limited labeled data and abundant unlabeled data. However, state-of-the-art methods ignore a potentially valuable source of unsupervised semantic information – spatial registration transforms between image volumes. To address this, we propose CCT-R, a contrastive cross-teaching framework incorporating registration information. To leverage the semantic information available in registrations between volume pairs, CCT-R incorporates two proposed modules: Registration Supervision Loss (RSL) and Registration-Enhanced Positive Sampling (REPS). The RSL leverages segmentation knowledge derived from transforms between labeled and unlabeled volume pairs, providing an additional source of pseudo-labels. REPS enhances contrastive learning by identifying anatomically-corresponding positives across volumes using registration transforms. Experimental results on two challenging medical segmentation benchmarks demonstrate the effectiveness and superiority of CCT-R across various semi-supervised settings, with as few as one labeled case. Our code is available at this https URL.

[CV-7] Prompt-and-Transfer: Dynamic Class-aware Enhancement for Few-shot Segmentation

链接: https://arxiv.org/abs/2409.10389
作者: Hanbo Bi,Yingchao Feng,Wenhui Diao,Peijin Wang,Yongqiang Mao,Kun Fu,Hongqi Wang,Xian Sun
关键词-EN: directly exploit pre-trained, exploit pre-trained encoders, Few-shot Segmentation, fine-tune the decoder, large models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:For more efficient generalization to unseen domains (classes), most Few-shot Segmentation (FSS) would directly exploit pre-trained encoders and only fine-tune the decoder, especially in the current era of large models. However, such fixed feature encoders tend to be class-agnostic, inevitably activating objects that are irrelevant to the target class. In contrast, humans can effortlessly focus on specific objects in the line of sight. This paper mimics the visual perception pattern of human beings and proposes a novel and powerful prompt-driven scheme, called ``Prompt and Transfer" (PAT), which constructs a dynamic class-aware prompting paradigm to tune the encoder for focusing on the interested object (target class) in the current task. Three key points are elaborated to enhance the prompting: 1) Cross-modal linguistic information is introduced to initialize prompts for each task. 2) Semantic Prompt Transfer (SPT) that precisely transfers the class-specific semantics within the images to prompts. 3) Part Mask Generator (PMG) that works in conjunction with SPT to adaptively generate different but complementary part prompts for different individuals. Surprisingly, PAT achieves competitive performance on 4 different tasks including standard FSS, Cross-domain FSS (e.g., CV, medical, and remote sensing domains), Weak-label FSS, and Zero-shot Segmentation, setting new state-of-the-arts on 11 benchmarks.

[CV-8] Mamba-ST: State Space Model for Efficient Style Transfer

链接: https://arxiv.org/abs/2409.10385
作者: Filippo Botti,Alex Ergasti,Leonardo Rossi,Tomaso Fontanini,Claudio Ferrari,Massimo Bertozzi,Andrea Prati
关键词-EN: style source, content image, image preserving, artistic representation, style transfer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The goal of style transfer is, given a content image and a style source, generating a new image preserving the content but with the artistic representation of the style source. Most of the state-of-the-art architectures use transformers or diffusion-based models to perform this task, despite the heavy computational burden that they require. In particular, transformers use self- and cross-attention layers which have large memory footprint, while diffusion models require high inference time. To overcome the above, this paper explores a novel design of Mamba, an emergent State-Space Model (SSM), called Mamba-ST, to perform style transfer. To do so, we adapt Mamba linear equation to simulate the behavior of cross-attention layers, which are able to combine two separate embeddings into a single output, but drastically reducing memory usage and time complexity. We modified the Mamba’s inner equations so to accept inputs from, and combine, two separate data streams. To the best of our knowledge, this is the first attempt to adapt the equations of SSMs to a vision task like style transfer without requiring any other module like cross-attention or custom normalization layers. An extensive set of experiments demonstrates the superiority and efficiency of our method in performing style transfer compared to transformers and diffusion models. Results show improved quality in terms of both ArtFID and FID metrics. Code is available at this https URL.

[CV-9] Robust image representations with counterfactual contrastive learning

链接: https://arxiv.org/abs/2409.10365
作者: Mélanie Roschewitz,Fabio De Sousa Ribeiro,Tian Xia,Galvin Khara,Ben Glocker
关键词-EN: increase model generalisation, Contrastive, contrastive learning, counterfactual contrastive learning, counterfactual contrastive
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Code available at this https URL

点击查看摘要

Abstract:Contrastive pretraining can substantially increase model generalisation and downstream performance. However, the quality of the learned representations is highly dependent on the data augmentation strategy applied to generate positive pairs. Positive contrastive pairs should preserve semantic meaning while discarding unwanted variations related to the data acquisition domain. Traditional contrastive pipelines attempt to simulate domain shifts through pre-defined generic image transformations. However, these do not always mimic realistic and relevant domain variations for medical imaging such as scanner differences. To tackle this issue, we herein introduce counterfactual contrastive learning, a novel framework leveraging recent advances in causal image synthesis to create contrastive positive pairs that faithfully capture relevant domain variations. Our method, evaluated across five datasets encompassing both chest radiography and mammography data, for two established contrastive objectives (SimCLR and DINO-v2), outperforms standard contrastive learning in terms of robustness to acquisition shift. Notably, counterfactual contrastive learning achieves superior downstream performance on both in-distribution and on external datasets, especially for images acquired with scanners under-represented in the training set. Further experiments show that the proposed framework extends beyond acquisition shifts, with models trained with counterfactual contrastive learning substantially improving subgroup performance across biological sex.

[CV-10] Frequency-Guided Masking for Enhanced Vision Self-Supervised Learning

链接: https://arxiv.org/abs/2409.10362
作者: Amin Karimi Monsefi,Mengxi Zhou,Nastaran Karimi Monsefi,Ser-Nam Lim,Wei-Lun Chao,Rajiv Ramnath
关键词-EN: frequency-based Self-Supervised Learning, approach that significantly, frequency-based Self-Supervised, significantly enhances, enhances its efficacy
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present a novel frequency-based Self-Supervised Learning (SSL) approach that significantly enhances its efficacy for pre-training. Prior work in this direction masks out pre-defined frequencies in the input image and employs a reconstruction loss to pre-train the model. While achieving promising results, such an implementation has two fundamental limitations as identified in our paper. First, using pre-defined frequencies overlooks the variability of image frequency responses. Second, pre-trained with frequency-filtered images, the resulting model needs relatively more data to adapt to naturally looking images during fine-tuning. To address these drawbacks, we propose FOurier transform compression with seLf-Knowledge distillation (FOLK), integrating two dedicated ideas. First, inspired by image compression, we adaptively select the masked-out frequencies based on image frequency responses, creating more suitable SSL tasks for pre-training. Second, we employ a two-branch framework empowered by knowledge distillation, enabling the model to take both the filtered and original images as input, largely reducing the burden of downstream tasks. Our experimental results demonstrate the effectiveness of FOLK in achieving competitive performance to many state-of-the-art SSL methods across various downstream tasks, including image classification, few-shot learning, and semantic segmentation.

[CV-11] 2D or not 2D: How Does the Dimensionality of Gesture Representation Affect 3D Co-Speech Gesture Generation?

链接: https://arxiv.org/abs/2409.10357
作者: Téo Guichoux,Laure Soulier,Nicolas Obin,Catherine Pelachaud
关键词-EN: Embodied Conversational Agents, fundamental for communication, synchronous co-speech gestures, Co-speech gestures, Conversational Agents
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Co-speech gestures are fundamental for communication. The advent of recent deep learning techniques has facilitated the creation of lifelike, synchronous co-speech gestures for Embodied Conversational Agents. “In-the-wild” datasets, aggregating video content from platforms like YouTube via human pose detection technologies, provide a feasible solution by offering 2D skeletal sequences aligned with speech. Concurrent developments in lifting models enable the conversion of these 2D sequences into 3D gesture databases. However, it is important to note that the 3D poses estimated from the 2D extracted poses are, in essence, approximations of the ground-truth, which remains in the 2D domain. This distinction raises questions about the impact of gesture representation dimensionality on the quality of generated motions - a topic that, to our knowledge, remains largely unexplored. Our study examines the effect of using either 2D or 3D joint coordinates as training data on the performance of speech-to-gesture deep generative models. We employ a lifting model for converting generated 2D pose sequences into 3D and assess how gestures created directly in 3D stack up against those initially generated in 2D and then converted to 3D. We perform an objective evaluation using widely used metrics in the gesture generation field as well as a user study to qualitatively evaluate the different approaches.

[CV-12] aming Diffusion Models for Image Restoration: A Review

链接: https://arxiv.org/abs/2409.10353
作者: Ziwei Luo,Fredrik K. Gustafsson,Zheng Zhao,Jens Sjölund,Thomas B. Schön
关键词-EN: achieved remarkable progress, enhancing image quality, Diffusion models, generative modelling, human preferences
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Review paper; any comments and suggestions are most welcome!

点击查看摘要

Abstract:Diffusion models have achieved remarkable progress in generative modelling, particularly in enhancing image quality to conform to human preferences. Recently, these models have also been applied to low-level computer vision for photo-realistic image restoration (IR) in tasks such as image denoising, deblurring, dehazing, etc. In this review paper, we introduce key constructions in diffusion models and survey contemporary techniques that make use of diffusion models in solving general IR tasks. Furthermore, we point out the main challenges and limitations of existing diffusion-based IR frameworks and provide potential directions for future work.

[CV-13] Point2Graph: An End-to-end Point Cloud-based 3D Open-Vocabulary Scene Graph for Robot Navigation

链接: https://arxiv.org/abs/2409.10350
作者: Yifan Xu,Ziming Luo,Qianwei Wang,Vineet Kamat,Carol Menassa
关键词-EN: scene graph generation, open-vocabulary scene graph, posed RGB-D images, algorithms highly rely, RGB-D images
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 9 figures

点击查看摘要

Abstract:Current open-vocabulary scene graph generation algorithms highly rely on both 3D scene point cloud data and posed RGB-D images and thus have limited applications in scenarios where RGB-D images or camera poses are not readily available. To solve this problem, we propose Point2Graph, a novel end-to-end point cloud-based 3D open-vocabulary scene graph generation framework in which the requirement of posed RGB-D image series is eliminated. This hierarchical framework contains room and object detection/segmentation and open-vocabulary classification. For the room layer, we leverage the advantage of merging the geometry-based border detection algorithm with the learning-based region detection to segment rooms and create a “Snap-Lookup” framework for open-vocabulary room classification. In addition, we create an end-to-end pipeline for the object layer to detect and classify 3D objects based solely on 3D point cloud data. Our evaluation results show that our framework can outperform the current state-of-the-art (SOTA) open-vocabulary object and room segmentation and classification algorithm on widely used real-scene datasets.

[CV-14] Phys3DGS: Physically-based 3D Gaussian Splatting for Inverse Rendering

链接: https://arxiv.org/abs/2409.10335
作者: Euntae Choi,Sungjoo Yoo
关键词-EN: rendering, Gaussian splatting, inverse rendering, based inverse rendering, deferred rendering
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
*备注: Under review

点击查看摘要

Abstract:We propose two novel ideas (adoption of deferred rendering and mesh-based representation) to improve the quality of 3D Gaussian splatting (3DGS) based inverse rendering. We first report a problem incurred by hidden Gaussians, where Gaussians beneath the surface adversely affect the pixel color in the volume rendering adopted by the existing methods. In order to resolve the problem, we propose applying deferred rendering and report new problems incurred in a naive application of deferred rendering to the existing 3DGS-based inverse rendering. In an effort to improve the quality of 3DGS-based inverse rendering under deferred rendering, we propose a novel two-step training approach which (1) exploits mesh extraction and utilizes a hybrid mesh-3DGS representation and (2) applies novel regularization methods to better exploit the mesh. Our experiments show that, under relighting, the proposed method offers significantly better rendering quality than the existing 3DGS-based inverse rendering methods. Compared with the SOTA voxel grid-based inverse rendering method, it gives better rendering quality while offering real-time rendering.

[CV-15] DRIVE: Dependable Robust Interpretable Visionary Ensemble Framework in Autonomous Driving

链接: https://arxiv.org/abs/2409.10330
作者: Songning Lai,Tianlang Xue,Hongru Xiao,Lijie Hu,Jiemin Wu,Ninghui Feng,Runwei Guan,Haicheng Liao,Zhenning Li,Yutao Yue
关键词-EN: map sensory inputs, sensory inputs directly, autonomous driving, autonomous driving models, Interpretable Visionary Ensemble
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in autonomous driving have seen a paradigm shift towards end-to-end learning paradigms, which map sensory inputs directly to driving actions, thereby enhancing the robustness and adaptability of autonomous vehicles. However, these models often sacrifice interpretability, posing significant challenges to trust, safety, and regulatory compliance. To address these issues, we introduce DRIVE – Dependable Robust Interpretable Visionary Ensemble Framework in Autonomous Driving, a comprehensive framework designed to improve the dependability and stability of explanations in end-to-end unsupervised autonomous driving models. Our work specifically targets the inherent instability problems observed in the Driving through the Concept Gridlock (DCG) model, which undermine the trustworthiness of its explanations and decision-making processes. We define four key attributes of DRIVE: consistent interpretability, stable interpretability, consistent output, and stable output. These attributes collectively ensure that explanations remain reliable and robust across different scenarios and perturbations. Through extensive empirical evaluations, we demonstrate the effectiveness of our framework in enhancing the stability and dependability of explanations, thereby addressing the limitations of current models. Our contributions include an in-depth analysis of the dependability issues within the DCG model, a rigorous definition of DRIVE with its fundamental properties, a framework to implement DRIVE, and novel metrics for evaluating the dependability of concept-based explainable autonomous driving models. These advancements lay the groundwork for the development of more reliable and trusted autonomous driving systems, paving the way for their broader acceptance and deployment in real-world applications.

[CV-16] InfoDisent: Explainability of Image Classification Models by Information Disentanglement

链接: https://arxiv.org/abs/2409.10329
作者: Łukasz Struski,Jacek Tabor
关键词-EN: critical area, area of research, methods, post-hoc methods, decisions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Understanding the decisions made by image classification networks is a critical area of research in deep learning. This task is traditionally divided into two distinct approaches: post-hoc methods and intrinsic methods. Post-hoc methods, such as GradCam, aim to interpret the decisions of pre-trained models by identifying regions of the image where the network focuses its attention. However, these methods provide only a high-level overview, making it difficult to fully understand the network’s decision-making process. Conversely, intrinsic methods, like prototypical parts models, offer a more detailed understanding of network predictions but are constrained by specific architectures, training methods, and datasets. In this paper, we introduce InfoDisent, a hybrid model that combines the advantages of both approaches. By utilizing an information bottleneck, InfoDisent disentangles the information in the final layer of a pre-trained deep network, enabling the breakdown of classification decisions into basic, understandable atomic components. Unlike standard prototypical parts approaches, InfoDisent can interpret the decisions of pre-trained classification networks and be used for making classification decisions, similar to intrinsic models. We validate the effectiveness of InfoDisent on benchmark datasets such as ImageNet, CUB-200-2011, Stanford Cars, and Stanford Dogs for both convolutional and transformer backbones. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.10329 [cs.CV] (or arXiv:2409.10329v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.10329 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-17] Fuse4Seg: Image-Level Fusion Based Multi-Modality Medical Image Segmentation

链接: https://arxiv.org/abs/2409.10328
作者: Yuchen Guo,Weifeng Su
关键词-EN: holds significant potential, integrating diverse imaging, existing methods predominantly, methods predominantly rely, segmentation holds significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Although multi-modality medical image segmentation holds significant potential for enhancing the diagnosis and understanding of complex diseases by integrating diverse imaging modalities, existing methods predominantly rely on feature-level fusion strategies. We argue the current feature-level fusion strategy is prone to semantic inconsistencies and misalignments across various imaging modalities because it merges features at intermediate layers in a neural network without evaluative control. To mitigate this, we introduce a novel image-level fusion based multi-modality medical image segmentation method, Fuse4Seg, which is a bi-level learning framework designed to model the intertwined dependencies between medical image segmentation and medical image fusion. The image-level fusion process is seamlessly employed to guide and enhance the segmentation results through a layered optimization approach. Besides, the knowledge gained from the segmentation module can effectively enhance the fusion module. This ensures that the resultant fused image is a coherent representation that accurately amalgamates information from all modalities. Moreover, we construct a BraTS-Fuse benchmark based on BraTS dataset, which includes 2040 paired original images, multi-modal fusion images, and ground truth. This benchmark not only serves image-level medical segmentation but is also the largest dataset for medical image fusion to date. Extensive experiments on several public datasets and our benchmark demonstrate the superiority of our approach over prior state-of-the-art (SOTA) methodologies.

[CV-18] Baking Relightable NeRF for Real-time Direct/Indirect Illumination Rendering

链接: https://arxiv.org/abs/2409.10327
作者: Euntae Choi,Vincent Carpentier,Seunghun Shin,Sungjoo Yoo
关键词-EN: immersive photo-realistic experience, training time, photo-realistic experience, feature for immersive, immersive photo-realistic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under review

点击查看摘要

Abstract:Relighting, which synthesizes a novel view under a given lighting condition (unseen in training time), is a must feature for immersive photo-realistic experience. However, real-time relighting is challenging due to high computation cost of the rendering equation which requires shape and material decomposition and visibility test to model shadow. Additionally, for indirect illumination, additional computation of rendering equation on each secondary surface point (where reflection occurs) is required rendering real-time relighting challenging. We propose a novel method that executes a CNN renderer to compute primary surface points and rendering parameters, required for direct illumination. We also present a lightweight hash grid-based renderer, for indirect illumination, which is recursively executed to perform the secondary ray tracing process. Both renderers are trained in a distillation from a pre-trained teacher model and provide real-time physically-based rendering under unseen lighting condition at a negligible loss of rendering quality.

[CV-19] On Synthetic Texture Datasets: Challenges Creation and Curation

链接: https://arxiv.org/abs/2409.10297
作者: Blaine Hoak,Patrick McDaniel
关键词-EN: machine learning models, machine learning, texture, ongoing investigation, texture images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The influence of textures on machine learning models has been an ongoing investigation, specifically in texture bias/learning, interpretability, and robustness. However, due to the lack of large and diverse texture data available, the findings in these works have been limited, as more comprehensive evaluations have not been feasible. Image generative models are able to provide data creation at scale, but utilizing these models for texture synthesis has been unexplored and poses additional challenges both in creating accurate texture images and validating those images. In this work, we introduce an extensible methodology and corresponding new dataset for generating high-quality, diverse texture images capable of supporting a broad set of texture-based tasks. Our pipeline consists of: (1) developing prompts from a range of descriptors to serve as input to text-to-image models, (2) adopting and adapting Stable Diffusion pipelines to generate and filter the corresponding images, and (3) further filtering down to the highest quality images. Through this, we create the Prompted Textures Dataset (PTD), a dataset of 362,880 texture images that span 56 textures. During the process of generating images, we find that NSFW safety filters in image generation pipelines are highly sensitive to texture (and flag up to 60% of our texture images), uncovering a potential bias in these models and presenting unique challenges when working with texture data. Through both standard metrics and a human evaluation, we find that our dataset is high quality and diverse.

[CV-20] Anatomical Positional Embeddings

链接: https://arxiv.org/abs/2409.10291
作者: Mikhail Goncharov,Valentin Samokhin,Eugenia Soboleva,Roman Sokolov,Boris Shirokikh,Mikhail Belyaev,Anvar Kurmukov,Ivan Oseledets
关键词-EN: self-supervised model producing, individual medical image, anatomical positional embeddings, positional embeddings, medical image voxels
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose a self-supervised model producing 3D anatomical positional embeddings (APE) of individual medical image voxels. APE encodes voxels’ anatomical closeness, i.e., voxels of the same organ or nearby organs always have closer positional embeddings than the voxels of more distant body parts. In contrast to the existing models of anatomical positional embeddings, our method is able to efficiently produce a map of voxel-wise embeddings for a whole volumetric input image, which makes it an optimal choice for different downstream applications. We train our APE model on 8400 publicly available CT images of abdomen and chest regions. We demonstrate its superior performance compared with the existing models on anatomical landmark retrieval and weakly-supervised few-shot localization of 13 abdominal organs. As a practical application, we show how to cheaply train APE to crop raw CT images to different anatomical regions of interest with 0.99 recall, while reducing the image volume by 10-100 times. The code and the pre-trained APE model are available at this https URL .

[CV-21] Enhancing Image Classification in Small and Unbalanced Datasets through Synthetic Data Augmentation

链接: https://arxiv.org/abs/2409.10286
作者: Neil De La Fuente,Mireia Majó,Irina Luzko,Henry Córdova,Gloria Fernández-Esparrach,Jorge Bernal
关键词-EN: Accurate and robust, present high imbalance, medical image classification, robust medical image, class-specific Variational Autoencoders
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate and robust medical image classification is a challenging task, especially in application domains where available annotated datasets are small and present high imbalance between target classes. Considering that data acquisition is not always feasible, especially for underrepresented classes, our approach introduces a novel synthetic augmentation strategy using class-specific Variational Autoencoders (VAEs) and latent space interpolation to improve discrimination capabilities. By generating realistic, varied synthetic data that fills feature space gaps, we address issues of data scarcity and class imbalance. The method presented in this paper relies on the interpolation of latent representations within each class, thus enriching the training set and improving the model’s generalizability and diagnostic accuracy. The proposed strategy was tested in a small dataset of 321 images created to train and validate an automatic method for assessing the quality of cleanliness of esophagogastroduodenoscopy images. By combining real and synthetic data, an increase of over 18% in the accuracy of the most challenging underrepresented class was observed. The proposed strategy not only benefited the underrepresented class but also led to a general improvement in other metrics, including a 6% increase in global accuracy and precision. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2409.10286 [cs.CV] (or arXiv:2409.10286v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.10286 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-22] Performance of Human Annotators in Object Detection and Segmentation of Remotely Sensed Data

链接: https://arxiv.org/abs/2409.10272
作者: Roni Blushtein-Livnon,Tal Svoray,Michael Dorman
关键词-EN: laboratory experiment designed, introduces a laboratory, designed to assess, assess the influence, human annotators
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 10 figures, 2 tables

点击查看摘要

Abstract:This study introduces a laboratory experiment designed to assess the influence of annotation strategies, levels of imbalanced data, and prior experience, on the performance of human annotators. The experiment focuses on labeling aerial imagery, using ArcGIS Pro tools, to detect and segment small-scale photovoltaic solar panels, selected as a case study for rectangular objects. The experiment is conducted using images with a pixel size of 0.15\textbf m , involving both expert and non-expert participants, across different setup strategies and target-background ratio datasets. Our findings indicate that human annotators generally perform more effectively in object detection than in segmentation tasks. A marked tendency to commit more Type II errors (False Negatives, i.e., undetected objects) than Type I errors (False Positives, i.e. falsely detecting objects that do not exist) was observed across all experimental setups and conditions, suggesting a consistent bias in detection and segmentation processes. Performance was better in tasks with higher target-background ratios (i.e., more objects per unit area). Prior experience did not significantly impact performance and may, in some cases, even lead to overestimation in segmentation. These results provide evidence that human annotators are relatively cautious and tend to identify objects only when they are confident about them, prioritizing underestimation over overestimation. Annotators’ performance is also influenced by object scarcity, showing a decline in areas with extremely imbalanced datasets and a low ratio of target-to-background. These findings may enhance annotation strategies for remote sensing research while efficient human annotators are crucial in an era characterized by growing demands for high-quality training data to improve segmentation and detection models.

[CV-23] BAFNet: Bilateral Attention Fusion Network for Lightweight Semantic Segmentation of Urban Remote Sensing Images

链接: https://arxiv.org/abs/2409.10269
作者: Wentao Wang,Xili Wang
关键词-EN: Large-scale semantic segmentation, limited sample sizes, Large-scale semantic, achieve high performance, achieve high
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large-scale semantic segmentation networks often achieve high performance, while their application can be challenging when faced with limited sample sizes and computational resources. In scenarios with restricted network size and computational complexity, models encounter significant challenges in capturing long-range dependencies and recovering detailed information in images. We propose a lightweight bilateral semantic segmentation network called bilateral attention fusion network (BAFNet) to efficiently segment high-resolution urban remote sensing images. The model consists of two paths, namely dependency path and remote-local path. The dependency path utilizes large kernel attention to acquire long-range dependencies in the image. Besides, multi-scale local attention and efficient remote attention are designed to construct remote-local path. Finally, a feature aggregation module is designed to effectively utilize the different features of the two paths. Our proposed method was tested on public high-resolution urban remote sensing datasets Vaihingen and Potsdam, with mIoU reaching 83.20% and 86.53%, respectively. As a lightweight semantic segmentation model, BAFNet not only outperforms advanced lightweight models in accuracy but also demonstrates comparable performance to non-lightweight state-of-the-art methods on two datasets, despite a tenfold variance in floating-point operations and a fifteenfold difference in network parameters.

[CV-24] Hydra-SGG: Hybrid Relation Assignment for One-stage Scene Graph Generation

链接: https://arxiv.org/abs/2409.10262
作者: Minghan Chen,Guikun Chen,Wenguan Wang,Yi Yang
关键词-EN: scene graph generation, DETR introduces, Relation Assignment, Hybrid Relation Assignment, relation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:DETR introduces a simplified one-stage framework for scene graph generation (SGG). However, DETR-based SGG models face two challenges: i) Sparse supervision, as each image typically contains fewer than 10 relation annotations, while the models employ over 100 relation queries. This sparsity arises because each ground truth relation is assigned to only one single query during training. ii) False negative samples, since one ground truth relation may have multiple queries with similar matching scores. These suboptimally matched queries are simply treated as negative samples, causing the loss of valuable supervisory signals. As a response, we devise Hydra-SGG, a one-stage SGG method that adopts a new Hybrid Relation Assignment. This assignment combines a One-to-One Relation Assignment with a newly introduced IoU-based One-to-Many Relation Assignment. Specifically, each ground truth is assigned to multiple relation queries with high IoU subject-object boxes. This Hybrid Relation Assignment increases the number of positive training samples, alleviating sparse supervision. Moreover, we, for the first time, empirically show that self-attention over relation queries helps reduce duplicated relation predictions. We, therefore, propose Hydra Branch, a parameter-sharing auxiliary decoder without a self-attention layer. This design promotes One-to-Many Relation Assignment by enabling different queries to predict the same relation. Hydra-SGG achieves state-of-the-art performance with 10.6 mR@20 and 16.0 mR@50 on VG150, while only requiring 12 training epochs. It also sets a new state-of-the-art on Open Images V6 and and GQA.

[CV-25] SOLVR: Submap Oriented LiDAR-Visual Re-Localisation ICRA2025

链接: https://arxiv.org/abs/2409.10247
作者: Joshua Knights,Sebastián Barbas Laina,Peyman Moghadam,Stefan Leutenegger
关键词-EN: paper proposes SOLVR, based LiDAR-Visual re-localisation, learning based LiDAR-Visual, performs place recognition, sensor modalities
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to ICRA2025

点击查看摘要

Abstract:This paper proposes SOLVR, a unified pipeline for learning based LiDAR-Visual re-localisation which performs place recognition and 6-DoF registration across sensor modalities. We propose a strategy to align the input sensor modalities by leveraging stereo image streams to produce metric depth predictions with pose information, followed by fusing multiple scene views from a local window using a probabilistic occupancy framework to expand the limited field-of-view of the camera. Additionally, SOLVR adopts a flexible definition of what constitutes positive examples for different training losses, allowing us to simultaneously optimise place recognition and registration performance. Furthermore, we replace RANSAC with a registration function that weights a simple least-squares fitting with the estimated inlier likelihood of sparse keypoint correspondences, improving performance in scenarios with a low inlier ratio between the query and retrieved place. Our experiments on the KITTI and KITTI360 datasets show that SOLVR achieves state-of-the-art performance for LiDAR-Visual place recognition and registration, particularly improving registration accuracy over larger distances between the query and retrieved place.

[CV-26] Robust Birds Eye View Segmentation by Adapting DINOv2 ECCV2024

链接: https://arxiv.org/abs/2409.10228
作者: Merve Rabia Barın,Görkay Aydemir,Fatma Güney
关键词-EN: Bird Eye View, Extracting a Bird, Eye View, Bird Eye, multiple camera images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024 - 2nd Workshop on Vision-Centric Autonomous Driving (VCAD)

点击查看摘要

Abstract:Extracting a Bird’s Eye View (BEV) representation from multiple camera images offers a cost-effective, scalable alternative to LIDAR-based solutions in autonomous driving. However, the performance of the existing BEV methods drops significantly under various corruptions such as brightness and weather changes or camera failures. To improve the robustness of BEV perception, we propose to adapt a large vision foundational model, DINOv2, to BEV estimation using Low Rank Adaptation (LoRA). Our approach builds on the strong representation space of DINOv2 by adapting it to the BEV task in a state-of-the-art framework, SimpleBEV. Our experiments show increased robustness of BEV perception under various corruptions, with increasing gains from scaling up the model and the input resolution. We also showcase the effectiveness of the adapted representations in terms of fewer learnable parameters and faster convergence during training.

[CV-27] Neuromorphic Facial Analysis with Cross-Modal Supervision ECCV2024

链接: https://arxiv.org/abs/2409.10213
作者: Federico Becattini,Luca Cultrera,Lorenzo Berlincioni,Claudio Ferrari,Andrea Leonardo,Alberto Del Bimbo
关键词-EN: Traditional approaches, analyzing RGB frames, approaches for analyzing, capable of providing, providing a fine-grained
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for publication at the ECCV 2024 workshop on Neuromorphic Vision: Advantages and Applications of Event Cameras (NEVI)

点击查看摘要

Abstract:Traditional approaches for analyzing RGB frames are capable of providing a fine-grained understanding of a face from different angles by inferring emotions, poses, shapes, landmarks. However, when it comes to subtle movements standard RGB cameras might fall behind due to their latency, making it hard to detect micro-movements that carry highly informative cues to infer the true emotions of a subject. To address this issue, the usage of event cameras to analyze faces is gaining increasing interest. Nonetheless, all the expertise matured for RGB processing is not directly transferrable to neuromorphic data due to a strong domain shift and intrinsic differences in how data is represented. The lack of labeled data can be considered one of the main causes of this gap, yet gathering data is harder in the event domain since it cannot be crawled from the web and labeling frames should take into account event aggregation rates and the fact that static parts might not be visible in certain frames. In this paper, we first present FACEMORPHIC, a multimodal temporally synchronized face dataset comprising both RGB videos and event streams. The data is labeled at a video level with facial Action Units and also contains streams collected with a variety of applications in mind, ranging from 3D shape estimation to lip-reading. We then show how temporal synchronization can allow effective neuromorphic face analysis without the need to manually annotate videos: we instead leverage cross-modal supervision bridging the domain gap by representing face shapes in a 3D space.

[CV-28] Garment Attribute Manipulation with Multi-level Attention ECCV2024

链接: https://arxiv.org/abs/2409.10206
作者: Vittorio Casula,Lorenzo Berlincioni,Luca Cultrera,Federico Becattini,Chiara Pero,Carmen Bisogni,Marco Bertini,Alberto Del Bimbo
关键词-EN: rapidly evolving field, online fashion shopping, interactive image retrieval, image retrieval systems, rapidly evolving
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for publication at the ECCV 2024 workshop FashionAI

点击查看摘要

Abstract:In the rapidly evolving field of online fashion shopping, the need for more personalized and interactive image retrieval systems has become paramount. Existing methods often struggle with precisely manipulating specific garment attributes without inadvertently affecting others. To address this challenge, we propose GAMMA (Garment Attribute Manipulation with Multi-level Attention), a novel framework that integrates attribute-disentangled representations with a multi-stage attention-based architecture. GAMMA enables targeted manipulation of fashion image attributes, allowing users to refine their searches with high accuracy. By leveraging a dual-encoder Transformer and memory block, our model achieves state-of-the-art performance on popular datasets like Shopping100k and DeepFashion.

[CV-29] SteeredMarigold: Steering Diffusion Towards Depth Completion of Largely Incomplete Depth Maps

链接: https://arxiv.org/abs/2409.10202
作者: Jakub Gregorek,Lazaros Nalpantidis
关键词-EN: RGB-D sensors deployed, valid depth measurements, areas missing valid, missing valid depth, depth maps captured
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Even if the depth maps captured by RGB-D sensors deployed in real environments are often characterized by large areas missing valid depth measurements, the vast majority of depth completion methods still assumes depth values covering all areas of the scene. To address this limitation, we introduce SteeredMarigold, a training-free, zero-shot depth completion method capable of producing metric dense depth, even for largely incomplete depth maps. SteeredMarigold achieves this by using the available sparse depth points as conditions to steer a denoising diffusion probabilistic model. Our method outperforms relevant top-performing methods on the NYUv2 dataset, in tests where no depth was provided for a large area, achieving state-of-art performance and exhibiting remarkable robustness against depth map incompleteness. Our code will be publicly available.

[CV-30] Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models

链接: https://arxiv.org/abs/2409.10197
作者: Weihao Ye,Qiong Wu,Wenhao Lin,Yiyi Zhou
关键词-EN: Large Language Models, Multimodal Large Language, Language Models, exhibits obvious redundancy, progress in Multimodal
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Recent progress in Multimodal Large Language Models(MLLMs) often use large image tokens to compensate the visual shortcoming of MLLMs, which not only exhibits obvious redundancy but also greatly exacerbates the already high computation. Token pruning is an effective solution for speeding up MLLMs, but when and how to drop tokens still remains a challenge. In this paper, we propose a novel and training-free approach for the effective visual token pruning of MLLMs, termed FitPrune, which can quickly produce a complete pruning recipe for MLLMs according to a pre-defined budget. Specifically, FitPrune considers token pruning as a statistical problem of MLLM and its objective is to find out an optimal pruning scheme that can minimize the divergence of the attention distributions before and after pruning. In practice, FitPrune can be quickly accomplished based on the attention statistics from a small batch of inference data, avoiding the expensive trials of MLLMs. According to the pruning recipe, an MLLM can directly remove the redundant visual tokens of different examples during inference. To validate FitPrune, we apply it to a set of recent MLLMs, including LLaVA-1.5, LLaVA-HR and LLaVA-NEXT, and conduct extensive experiments on a set of benchmarks. The experimental results show that our FitPrune can not only reduce the computational complexity to a large extent, while retaining high performance, e.g., -54.9% FLOPs for LLaVA-NEXT with only 0.5% accuracy drop. Notably, the pruning recipe can be obtained in about 5 minutes. Our code is available at this https URL.

[CV-31] NEUSIS: A Compositional Neuro-Symbolic Framework for Autonomous Perception Reasoning and Planning in Complex UAV Search Missions

链接: https://arxiv.org/abs/2409.10196
作者: Zhixi Cai,Cristian Rojas Cardenas,Kevin Leo,Chenyuan Zhang,Kal Backman,Hanbing Li,Boying Li,Mahsa Ghorbanali,Stavya Datta,Lizhen Qu,Julian Gutierrez Santiago,Alexey Ignatiev,Yuan-Fang Li,Mor Vered,Peter J Stuckey,Maria Garcia de la Banda,Hamid Rezatofighi
关键词-EN: locate specific Entities, Entities of Interest, specific Entities, time limit, descriptions in large
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper addresses the problem of autonomous UAV search missions, where a UAV must locate specific Entities of Interest (EOIs) within a time limit, based on brief descriptions in large, hazard-prone environments with keep-out zones. The UAV must perceive, reason, and make decisions with limited and uncertain information. We propose NEUSIS, a compositional neuro-symbolic system designed for interpretable UAV search and navigation in realistic scenarios. NEUSIS integrates neuro-symbolic visual perception, reasoning, and grounding (GRiD) to process raw sensory inputs, maintains a probabilistic world model for environment representation, and uses a hierarchical planning component (SNaC) for efficient path planning. Experimental results from simulated urban search missions using AirSim and Unreal Engine show that NEUSIS outperforms a state-of-the-art (SOTA) vision-language model and a SOTA search planning model in success rate, search efficiency, and 3D localization. These results demonstrate the effectiveness of our compositional neuro-symbolic approach in handling complex, real-world scenarios, making it a promising solution for autonomous UAV systems in search missions.

[CV-32] RealDiff: Real-world 3D Shape Completion using Self-Supervised Diffusion Models

链接: https://arxiv.org/abs/2409.10180
作者: Başak Melis Öcal,Maxim Tatarchenko,Sezer Karaoglu,Theo Gevers
关键词-EN: Point cloud completion, cloud completion aims, Point cloud, recover the complete, aims to recover
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Point cloud completion aims to recover the complete 3D shape of an object from partial observations. While approaches relying on synthetic shape priors achieved promising results in this domain, their applicability and generalizability to real-world data are still limited. To tackle this problem, we propose a self-supervised framework, namely RealDiff, that formulates point cloud completion as a conditional generation problem directly on real-world measurements. To better deal with noisy observations without resorting to training on synthetic data, we leverage additional geometric cues. Specifically, RealDiff simulates a diffusion process at the missing object parts while conditioning the generation on the partial input to address the multimodal nature of the task. We further regularize the training by matching object silhouettes and depth maps, predicted by our method, with the externally estimated ones. Experimental results show that our method consistently outperforms state-of-the-art methods in real-world point cloud completion.

[CV-33] ExelMap: Explainable Element-based HD-Map Change Detection and Update

链接: https://arxiv.org/abs/2409.10178
作者: Lena Wild,Ludvig Ericson,Rafael Valencia,Patric Jensfelt
关键词-EN: map change detection, Acquisition and maintenance, change detection, map change, map
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 3 figures

点击查看摘要

Abstract:Acquisition and maintenance are central problems in deploying high-definition (HD) maps for autonomous driving, with two lines of research prevalent in current literature: Online HD map generation and HD map change detection. However, the generated map’s quality is currently insufficient for safe deployment, and many change detection approaches fail to precisely localize and extract the changed map elements, hence lacking explainability and hindering a potential fleet-based cooperative HD map update. In this paper, we propose the novel task of explainable element-based HD map change detection and update. In extending recent approaches that use online mapping techniques informed with an outdated map prior for HD map updating, we present ExelMap, an explainable element-based map updating strategy that specifically identifies changed map elements. In this context, we discuss how currently used metrics fail to capture change detection performance, while allowing for unfair comparison between prior-less and prior-informed map generation methods. Finally, we present an experimental study on real-world changes related to pedestrian crossings of the Argoverse 2 Map Change Dataset. To the best of our knowledge, this is the first comprehensive problem investigation of real-world end-to-end element-based HD map change detection and update, and ExelMap the first proposed solution.

[CV-34] VideoRun2D: Cost-Effective Markerless Motion Capture for Sprint Biomechanics ICPR

链接: https://arxiv.org/abs/2409.10175
作者: Gonzalo Garrido-Lopez,Luis F. Gomez,Julian Fierrez,Aythami Morales,Ruben Tolosana,Javier Rueda,Enrique Navarro
关键词-EN: determinant ability, team sports, Sprinting, human biomechanics, Pattern Recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint of the paper presented to the Workshop on IAPR International Conference on Pattern Recognition (ICPR) 2024

点击查看摘要

Abstract:Sprinting is a determinant ability, especially in team sports. The kinematics of the sprint have been studied in the past using different methods specially developed considering human biomechanics and, among those methods, markerless systems stand out as very cost-effective. On the other hand, we have now multiple general methods for pixel and body tracking based on recent machine learning breakthroughs with excellent performance in body tracking, but these excellent trackers do not generally consider realistic human biomechanics. This investigation first adapts two of these general trackers (MoveNet and CoTracker) for realistic biomechanical analysis and then evaluate them in comparison to manual tracking (with key points manually marked using the software Kinovea). Our best resulting markerless body tracker particularly adapted for sprint biomechanics is termed VideoRun2D. The experimental development and assessment of VideoRun2D is reported on forty sprints recorded with a video camera from 5 different subjects, focusing our analysis in 3 key angles in sprint biomechanics: inclination of the trunk, flex extension of the hip and the knee. The CoTracker method showed huge differences compared to the manual labeling approach. However, the angle curves were correctly estimated by the MoveNet method, finding errors between 3.2° and 5.5°. In conclusion, our proposed VideoRun2D based on MoveNet core seems to be a helpful tool for evaluating sprint kinematics in some scenarios. On the other hand, the observed precision of this first version of VideoRun2D as a markerless sprint analysis system may not be yet enough for highly demanding applications. Future research lines towards that purpose are also discussed at the end: better tracking post-processing and user- and time-dependent adaptation. Comments: Preprint of the paper presented to the Workshop on IAPR International Conference on Pattern Recognition (ICPR) 2024 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.10175 [cs.CV] (or arXiv:2409.10175v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.10175 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-35] SplatSim: Zero-Shot Sim2Real Transfer of RGB Manipulation Policies Using Gaussian Splatting

链接: https://arxiv.org/abs/2409.10161
作者: Mohammad Nomaan Qureshi,Sparsh Garg,Francisco Yandun,David Held,George Kantor,Abhishesh Silwal
关键词-EN: significant domain shift, RGB images, relying on RGB, manipulation policies relying, remains a critical
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sim2Real transfer, particularly for manipulation policies relying on RGB images, remains a critical challenge in robotics due to the significant domain shift between synthetic and real-world visual data. In this paper, we propose SplatSim, a novel framework that leverages Gaussian Splatting as the primary rendering primitive to reduce the Sim2Real gap for RGB-based manipulation policies. By replacing traditional mesh representations with Gaussian Splats in simulators, SplatSim produces highly photorealistic synthetic data while maintaining the scalability and cost-efficiency of simulation. We demonstrate the effectiveness of our framework by training manipulation policies within SplatSimand deploying them in the real world in a zero-shot manner, achieving an average success rate of 86.25%, compared to 97.5% for policies trained on real-world data.

[CV-36] Contrastive Learning for Character Detection in Ancient Greek Papyri

链接: https://arxiv.org/abs/2409.10156
作者: Vedasri Nakka,Andreas Fischer,Rolf Ingold,Lars Vogtlin
关键词-EN: ICDAR dataset, Greek letter recognition, dataset, SimCLR, Greek letter
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This thesis investigates the effectiveness of SimCLR, a contrastive learning technique, in Greek letter recognition, focusing on the impact of various augmentation techniques. We pretrain the SimCLR backbone using the Alpub dataset (pretraining dataset) and fine-tune it on a smaller ICDAR dataset (finetuning dataset) to compare SimCLR’s performance against traditional baseline models, which use cross-entropy and triplet loss functions. Additionally, we explore the role of different data augmentation strategies, essential for the SimCLR training process. Methodologically, we examine three primary approaches: (1) a baseline model using cross-entropy loss, (2) a triplet embedding model with a classification layer, and (3) a SimCLR pretrained model with a classification layer. Initially, we train the baseline, triplet, and SimCLR models using 93 augmentations on ResNet-18 and ResNet-50 networks with the ICDAR dataset. From these, the top four augmentations are selected using a statistical t-test. Pretraining of SimCLR is conducted on the Alpub dataset, followed by fine-tuning on the ICDAR dataset. The triplet loss model undergoes a similar process, being pretrained on the top four augmentations before fine-tuning on ICDAR. Our experiments show that SimCLR does not outperform the baselines in letter recognition tasks. The baseline model with cross-entropy loss demonstrates better performance than both SimCLR and the triplet loss model. This study provides a detailed evaluation of contrastive learning for letter recognition, highlighting SimCLR’s limitations while emphasizing the strengths of traditional supervised learning models in this task. We believe SimCLR’s cropping strategies may cause a semantic shift in the input image, reducing training effectiveness despite the large pretraining dataset. Our code is available at this https URL.

[CV-37] AutoPET Challenge III: Testing the Robustness of Generalized Dice Focal Loss trained 3D Residual UNet for FDG and PSMA Lesion Segmentation from Whole-Body PET/CT Images

链接: https://arxiv.org/abs/2409.10151
作者: Shadab Ahamed
关键词-EN: Automated segmentation, quantitative image analysis, crucial first step, step in quantitative, Generalized Dice Focal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
*备注: 11 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Automated segmentation of cancerous lesions in PET/CT scans is a crucial first step in quantitative image analysis. However, training deep learning models for segmentation with high accuracy is particularly challenging due to the variations in lesion size, shape, and radiotracer uptake. These lesions can appear in different parts of the body, often near healthy organs that also exhibit considerable uptake, making the task even more complex. As a result, creating an effective segmentation model for routine PET/CT image analysis is challenging. In this study, we utilized a 3D Residual UNet model and employed the Generalized Dice Focal Loss function to train the model on the AutoPET Challenge 2024 dataset. We conducted a 5-fold cross-validation and used an average ensembling technique using the models from the five folds. In the preliminary test phase for Task-1, the average ensemble achieved a mean Dice Similarity Coefficient (DSC) of 0.6687, mean false negative volume (FNV) of 10.9522 ml and mean false positive volume (FPV) 2.9684 ml. More details about the algorithm can be found on our GitHub repository: this https URL. The training code has been shared via the repository: this https URL.

[CV-38] P2U-SLAM: A Monocular Wide-FoV SLAM System Based on Point Uncertainty and Pose Uncertainty

链接: https://arxiv.org/abs/2409.10143
作者: Yufan Zhang,Kailun Yang,Ze Wang,Kaiwei Wang
关键词-EN: visual Simultaneous Localization, Simultaneous Localization, utilizes pose uncertainty, historical map points, visual Simultaneous
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: The source code will be made publicly available at this https URL

点击查看摘要

Abstract:This paper presents P2U-SLAM, a visual Simultaneous Localization And Mapping (SLAM) system with a wide Field of View (FoV) camera, which utilizes pose uncertainty and point uncertainty. While the wide FoV enables considerable repetitive observations of historical map points for matching cross-view features, the data properties of the historical map points and the poses of historical keyframes have changed during the optimization process. The neglect of data property changes triggers the absence of a partial information matrix in optimization and leads to the risk of long-term positioning performance degradation. The purpose of our research is to reduce the risk of the wide field of view visual input to the SLAM system. Based on the conditional probability model, this work reveals the definite impact of the above data properties changes on the optimization process, concretizes it as point uncertainty and pose uncertainty, and gives a specific mathematical form. P2U-SLAM respectively embeds point uncertainty and pose uncertainty into the tracking module and local mapping, and updates these uncertainties after each optimization operation including local mapping, map merging, and loop closing. We present an exhaustive evaluation in 27 sequences from two popular public datasets with wide-FoV visual input. P2U-SLAM shows excellent performance compared with other state-of-the-art methods. The source code will be made publicly available at this https URL.

[CV-39] PSHuman: Photorealistic Single-view Human Reconstruction using Cross-Scale Diffusion

链接: https://arxiv.org/abs/2409.10141
作者: Peng Li,Wangguandong Zheng,Yuan Liu,Tao Yu,Yangguang Li,Xingqun Qi,Mengfei Li,Xiaowei Chi,Siyu Xia,Wei Xue,Wenhan Luo,Qifeng Liu,Yike Guo
关键词-EN: tremendous progress, modeling is essential, Detailed and photorealistic, monocular RGB image, RGB image remains
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Detailed and photorealistic 3D human modeling is essential for various applications and has seen tremendous progress. However, full-body reconstruction from a monocular RGB image remains challenging due to the ill-posed nature of the problem and sophisticated clothing topology with self-occlusions. In this paper, we propose PSHuman, a novel framework that explicitly reconstructs human meshes utilizing priors from the multiview diffusion model. It is found that directly applying multiview diffusion on single-view human images leads to severe geometric distortions, especially on generated faces. To address it, we propose a cross-scale diffusion that models the joint probability distribution of global full-body shape and local facial characteristics, enabling detailed and identity-preserved novel-view generation without any geometric distortion. Moreover, to enhance cross-view body shape consistency of varied human poses, we condition the generative model on parametric models like SMPL-X, which provide body priors and prevent unnatural views inconsistent with human anatomy. Leveraging the generated multi-view normal and color images, we present SMPLX-initialized explicit human carving to recover realistic textured human meshes efficiently. Extensive experimental results and quantitative evaluations on CAPE and THuman2.1 datasets demonstrate PSHumans superiority in geometry details, texture fidelity, and generalization capability.

[CV-40] A Comparative Study of Open Source Computer Vision Models for Application on Small Data: The Case of CFRP Tape Laying

链接: https://arxiv.org/abs/2409.10104
作者: Thomas Fraunholz,Dennis Rall,Tim Köhler,Alfons Schuster,Monika Mayer,Lars Larsen
关键词-EN: Artificial Intelligence, automating existing processes, increasing role, materials and techniques, realm of industrial
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the realm of industrial manufacturing, Artificial Intelligence (AI) is playing an increasing role, from automating existing processes to aiding in the development of new materials and techniques. However, a significant challenge arises in smaller, experimental processes characterized by limited training data availability, questioning the possibility to train AI models in such small data contexts. In this work, we explore the potential of Transfer Learning to address this challenge, specifically investigating the minimum amount of data required to develop a functional AI model. For this purpose, we consider the use case of quality control of Carbon Fiber Reinforced Polymer (CFRP) tape laying in aerospace manufacturing using optical sensors. We investigate the behavior of different open-source computer vision models with a continuous reduction of the training data. Our results show that the amount of data required to successfully train an AI model can be drastically reduced, and the use of smaller models does not necessarily lead to a loss of performance.

[CV-41] Adaptive Segmentation-Based Initialization for Steered Mixture of Experts Image Regression

链接: https://arxiv.org/abs/2409.10101
作者: Yi-Hsin Li,Sebastian Knorr,Mårten Sjöström,Thomas Sikora
关键词-EN: Gaussian Splatting, provide excellent efficiency, denoising and super-resolution, light-field compression, initialization
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Kernel image regression methods have shown to provide excellent efficiency in many image processing task, such as image and light-field compression, Gaussian Splatting, denoising and super-resolution. The estimation of parameters for these methods frequently employ gradient descent iterative optimization, which poses significant computational burden for many applications. In this paper, we introduce a novel adaptive segmentation-based initialization method targeted for optimizing Steered-Mixture-of Experts (SMoE) gating networks and Radial-Basis-Function (RBF) networks with steering kernels. The novel initialization method allocates kernels into pre-calculated image segments. The optimal number of kernels, kernel positions, and steering parameters are derived per segment in an iterative optimization and kernel sparsification procedure. The kernel information from “local” segments is then transferred into a “global” initialization, ready for use in iterative optimization of SMoE, RBF, and related kernel image regression methods. Results show that drastic objective and subjective quality improvements are achievable compared to widely used regular grid initialization, “state-of-the-art” K-Means initialization and previously introduced segmentation-based initialization methods, while also drastically improving the sparsity of the regression models. For same quality, the novel initialization results in models with around 50% reduction of kernels. In addition, a significant reduction of convergence time is achieved, with overall run-time savings of up to 50%. The segmentation-based initialization strategy itself admits heavy parallel computation; in theory, it may be divided into as many tasks as there are segments in the images. By accessing only four parallel GPUs, run-time savings of already 50% for initialization are achievable.

[CV-42] Human Insights Driven Latent Space for Different Driving Perspectives: A Unified Encoder for Efficient Multi-Task Inference

链接: https://arxiv.org/abs/2409.10095
作者: Huy-Dung Nguyen,Anass Bairouk,Mirjana Maras,Wei Xiao,Tsun-Hsuan Wang,Patrick Chareyre,Ramin Hasani,Marc Blanchon,Daniela Rus
关键词-EN: transform road safety, minimizing human error, reducing congestion, transform road, road safety
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Autonomous driving holds great potential to transform road safety and traffic efficiency by minimizing human error and reducing congestion. A key challenge in realizing this potential is the accurate estimation of steering angles, which is essential for effective vehicle navigation and control. Recent breakthroughs in deep learning have made it possible to estimate steering angles directly from raw camera inputs. However, the limited available navigation data can hinder optimal feature learning, impacting the system’s performance in complex driving scenarios. In this paper, we propose a shared encoder trained on multiple computer vision tasks critical for urban navigation, such as depth, pose, and 3D scene flow estimation, as well as semantic, instance, panoptic, and motion segmentation. By incorporating diverse visual information used by humans during navigation, this unified encoder might enhance steering angle estimation. To achieve effective multi-task learning within a single encoder, we introduce a multi-scale feature network for pose estimation to improve depth learning. Additionally, we employ knowledge distillation from a multi-backbone model pretrained on these navigation tasks to stabilize training and boost performance. Our findings demonstrate that a shared backbone trained on diverse visual tasks is capable of providing overall perception capabilities. While our performance in steering angle estimation is comparable to existing methods, the integration of human-like perception through multi-task learning holds significant potential for advancing autonomous driving systems. More details and the pretrained model are available at this https URL.

[CV-43] DDoS: Diffusion Distribution Similarity for Out-of-Distribution Detection

链接: https://arxiv.org/abs/2409.10094
作者: Kun Fang,Qinghua Tao,Zuopeng Yang,Xiaolin Huang,Jie Yang
关键词-EN: distribution disparities, distribution, perceptual metrics, OoD, training distribution
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Out-of-Distribution (OoD) detection determines whether the given samples are from the training distribution of the classifier-under-protection, i.e., the In-Distribution (InD), or from a different OoD. Latest researches introduce diffusion models pre-trained on InD data to advocate OoD detection by transferring an OoD image into a generated one that is close to InD, so that one could capture the distribution disparities between original and generated images to detect OoD data. Existing diffusion-based detectors adopt perceptual metrics on the two images to measure such disparities, but ignore a fundamental fact: Perceptual metrics are devised essentially for human-perceived similarities of low-level image patterns, e.g., textures and colors, and are not advisable in evaluating distribution disparities, since images with different low-level patterns could possibly come from the same distribution. To address this issue, we formulate a diffusion-based detection framework that considers the distribution similarity between a tested image and its generated counterpart via a novel proper similarity metric in the informative feature space and probability space learned by the classifier-under-protection. An anomaly-removal strategy is further presented to enlarge such distribution disparities by removing abnormal OoD information in the feature space to facilitate the detection. Extensive empirical results unveil the insufficiency of perceptual metrics and the effectiveness of our distribution similarity framework with new state-of-the-art detection performance.

[CV-44] MotionCom: Automatic and Motion-Aware Image Composition with LLM and Video Diffusion Prior

链接: https://arxiv.org/abs/2409.10090
作者: Weijing Tao,Xiaofeng Yang,Miaomiao Cui,Guosheng Lin
关键词-EN: work presents MotionCom, dynamically coherent results, Large Vision Language, diffusion based image, enabling automatic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This work presents MotionCom, a training-free motion-aware diffusion based image composition, enabling automatic and seamless integration of target objects into new scenes with dynamically coherent results without finetuning or optimization. Traditional approaches in this area suffer from two significant limitations: they require manual planning for object placement and often generate static compositions lacking motion realism. MotionCom addresses these issues by utilizing a Large Vision Language Model (LVLM) for intelligent planning, and a Video Diffusion prior for motion-infused image synthesis, streamlining the composition process. Our multi-modal Chain-of-Thought (CoT) prompting with LVLM automates the strategic placement planning of foreground objects, considering their potential motion and interaction within the scenes. Complementing this, we propose a novel method MotionPaint to distill motion-aware information from pretrained video diffusion models in the generation phase, ensuring that these objects are not only seamlessly integrated but also endowed with realistic motion. Extensive quantitative and qualitative results highlight MotionCom’s superiority, showcasing its efficiency in streamlining the planning process and its capability to produce compositions that authentically depict motion and interaction.

[CV-45] DAE-Fuse: An Adaptive Discriminative Autoencoder for Multi-Modality Image Fusion

链接: https://arxiv.org/abs/2409.10080
作者: Yuchen Guo,Ruoxiang Xu,Rongcheng Li,Zhenghao Wu,Weifeng Su
关键词-EN: integrate complementary data, complementary data information, Multi-modality image fusion, Multi-modality image, image fusion aims
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multi-modality image fusion aims to integrate complementary data information from different imaging modalities into a single image. Existing methods often generate either blurry fused images that lose fine-grained semantic information or unnatural fused images that appear perceptually cropped from the inputs. In this work, we propose a novel two-phase discriminative autoencoder framework, termed DAE-Fuse, that generates sharp and natural fused images. In the adversarial feature extraction phase, we introduce two discriminative blocks into the encoder-decoder architecture, providing an additional adversarial loss to better guide feature extraction by reconstructing the source images. While the two discriminative blocks are adapted in the attention-guided cross-modality fusion phase to distinguish the structural differences between the fused output and the source inputs, injecting more naturalness into the results. Extensive experiments on public infrared-visible, medical image fusion, and downstream object detection datasets demonstrate our method’s superiority and generalizability in both quantitative and qualitative evaluations.

[CV-46] owards Physically-Realizable Adversarial Attacks in Embodied Vision Navigation ICRA

链接: https://arxiv.org/abs/2409.10071
作者: Meng Chen,Jiawei Tu,Chao Qi,Yonghao Dang,Feng Zhou,Wei Wei,Jianqin Yin
关键词-EN: deep neural networks, safety-critical environments raises, environments raises concerns, embodied navigation agents, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 8 pages, 6 figures, submitted to the 2025 IEEE International Conference on Robotics Automation (ICRA)

点击查看摘要

Abstract:The deployment of embodied navigation agents in safety-critical environments raises concerns about their vulnerability to adversarial attacks on deep neural networks. However, current attack methods often lack practicality due to challenges in transitioning from the digital to the physical world, while existing physical attacks for object detection fail to achieve both multi-view effectiveness and naturalness. To address this, we propose a practical attack method for embodied navigation by attaching adversarial patches with learnable textures and opacity to objects. Specifically, to ensure effectiveness across varying viewpoints, we employ a multi-view optimization strategy based on object-aware sampling, which uses feedback from the navigation model to optimize the patch’s texture. To make the patch inconspicuous to human observers, we introduce a two-stage opacity optimization mechanism, where opacity is refined after texture optimization. Experimental results show our adversarial patches reduce navigation success rates by about 40%, outperforming previous methods in practicality, effectiveness, and naturalness. Code is available at: [this https URL].

[CV-47] GlobalMapNet: An Online Framework for Vectorized Global HD Map Construction

链接: https://arxiv.org/abs/2409.10063
作者: Anqi Shi,Yuze Cai,Xiangyu Chen,Jian Pu,Zeyu Fu,Hong Lu
关键词-EN: autonomous driving systems, driving systems, essential for autonomous, autonomous driving, map
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:High-definition (HD) maps are essential for autonomous driving systems. Traditionally, an expensive and labor-intensive pipeline is implemented to construct HD maps, which is limited in scalability. In recent years, crowdsourcing and online mapping have emerged as two alternative methods, but they have limitations respectively. In this paper, we provide a novel methodology, namely global map construction, to perform direct generation of vectorized global maps, combining the benefits of crowdsourcing and online mapping. We introduce GlobalMapNet, the first online framework for vectorized global HD map construction, which updates and utilizes a global map on the ego vehicle. To generate the global map from scratch, we propose GlobalMapBuilder to match and merge local maps continuously. We design a new algorithm, Map NMS, to remove duplicate map elements and produce a clean map. We also propose GlobalMapFusion to aggregate historical map information, improving consistency of prediction. We examine GlobalMapNet on two widely recognized datasets, Argoverse2 and nuScenes, showing that our framework is capable of generating globally consistent results.

[CV-48] DENSER: 3D Gaussians Splatting for Scene Reconstruction of Dynamic Urban Environments

链接: https://arxiv.org/abs/2409.10041
作者: Mahmud A. Mohamad,Gamal Elghazaly,Arthur Hubert,Raphael Frank
关键词-EN: dynamic urban environments, paper presents DENSER, Gaussian splatting, dynamic objects, effective approach leveraging
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper presents DENSER, an efficient and effective approach leveraging 3D Gaussian splatting (3DGS) for the reconstruction of dynamic urban environments. While several methods for photorealistic scene representations, both implicitly using neural radiance fields (NeRF) and explicitly using 3DGS have shown promising results in scene reconstruction of relatively complex dynamic scenes, modeling the dynamic appearance of foreground objects tend to be challenging, limiting the applicability of these methods to capture subtleties and details of the scenes, especially far dynamic objects. To this end, we propose DENSER, a framework that significantly enhances the representation of dynamic objects and accurately models the appearance of dynamic objects in the driving scene. Instead of directly using Spherical Harmonics (SH) to model the appearance of dynamic objects, we introduce and integrate a new method aiming at dynamically estimating SH bases using wavelets, resulting in better representation of dynamic objects appearance in both space and time. Besides object appearance, DENSER enhances object shape representation through densification of its point cloud across multiple scene frames, resulting in faster convergence of model training. Extensive evaluations on KITTI dataset show that the proposed approach significantly outperforms state-of-the-art methods by a wide margin. Source codes and models will be uploaded to this repository this https URL

[CV-49] AttnMod: Attention-Based New Art Styles

链接: https://arxiv.org/abs/2409.10028
作者: Shih-Chieh Su
关键词-EN: Imagine a human, hoping to create, create a painting, human artist, generated photo
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Imagine a human artist looking at the generated photo of a diffusion model, and hoping to create a painting out of it. There could be some feature of the object in the photo that the artist wants to emphasize, some color to disperse, some silhouette to twist, or some part of the scene to be materialized. These intentions can be viewed as the modification of the cross attention from the text prompt onto UNet, during the desoising diffusion. This work presents AttnMod, to modify attention for creating new unpromptable art styles out of existing diffusion models. The style-creating behavior is studied across different setups.

[CV-50] LithoHoD: A Litho Simulator-Powered Framework for IC Layout Hotspot Detection

链接: https://arxiv.org/abs/2409.10021
作者: Hao-Chiang Shao,Guan-Yu Chen,Yu-Hsien Lin,Chia-Wen Lin,Shao-Yun Fang,Pin-Yian Tsai,Yan-Hsiu Liu
关键词-EN: VLSI fabrication technology, increased layout density, advances in VLSI, VLSI fabrication, hotspot detection techniques
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages to appear in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

点击查看摘要

Abstract:Recent advances in VLSI fabrication technology have led to die shrinkage and increased layout density, creating an urgent demand for advanced hotspot detection techniques. However, by taking an object detection network as the backbone, recent learning-based hotspot detectors learn to recognize only the problematic layout patterns in the training data. This fact makes these hotspot detectors difficult to generalize to real-world scenarios. We propose a novel lithography simulator-powered hotspot detection framework to overcome this difficulty. Our framework integrates a lithography simulator with an object detection backbone, merging the extracted latent features from both the simulator and the object detector via well-designed cross-attention blocks. Consequently, the proposed framework can be used to detect potential hotspot regions based on I) the variation of possible circuit shape deformation estimated by the lithography simulator, and ii) the problematic layout patterns already known. To this end, we utilize RetinaNet with a feature pyramid network as the object detection backbone and leverage LithoNet as the lithography simulator. Extensive experiments demonstrate that our proposed simulator-guided hotspot detection framework outperforms previous state-of-the-art methods on real-world data.

[CV-51] 2S-ODIS: Two-Stage Omni-Directional Image Synthesis by Geometric Distortion Correction ECCV2024

链接: https://arxiv.org/abs/2409.09969
作者: Atsuya Nakata,Takao Yamanaka
关键词-EN: Social Networking Services, Social Networking, Networking Services, including virtual reality, omni-directional image synthesis
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: ECCV2024 this https URL

点击查看摘要

Abstract:Omni-directional images have been increasingly used in various applications, including virtual reality and SNS (Social Networking Services). However, their availability is comparatively limited in contrast to normal field of view (NFoV) images, since specialized cameras are required to take omni-directional images. Consequently, several methods have been proposed based on generative adversarial networks (GAN) to synthesize omni-directional images, but these approaches have shown difficulties in training of the models, due to instability and/or significant time consumption in the training. To address these problems, this paper proposes a novel omni-directional image synthesis method, 2S-ODIS (Two-Stage Omni-Directional Image Synthesis), which generated high-quality omni-directional images but drastically reduced the training time. This was realized by utilizing the VQGAN (Vector Quantized GAN) model pre-trained on a large-scale NFoV image database such as ImageNet without fine-tuning. Since this pre-trained model does not represent distortions of omni-directional images in the equi-rectangular projection (ERP), it cannot be applied directly to the omni-directional image synthesis in ERP. Therefore, two-stage structure was adopted to first create a global coarse image in ERP and then refine the image by integrating multiple local NFoV images in the higher resolution to compensate the distortions in ERP, both of which are based on the pre-trained VQGAN model. As a result, the proposed method, 2S-ODIS, achieved the reduction of the training time from 14 days in OmniDreamer to four days in higher image quality.

[CV-52] Artificial Intelligence-Based Opportunistic Coronary Calcium Screening in the Veterans Affairs National Healthcare System

链接: https://arxiv.org/abs/2409.09968
作者: Raffi Hagopian,Timothy Strebel,Simon Bernatz,Gregory A Myers,Erik Offerman,Eric Zuniga,Cy Y Kim,Angie T Ng,James A Iwaz,Sunny P Singh,Evan P Carey,Michael J Kim,R Spencer Schaefer,Jeannie Yu,Amilcare Gentili,Hugo JWL Aerts
关键词-EN: Coronary artery calcium, Coronary artery, CAC, artery calcium, cardiovascular events
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Coronary artery calcium (CAC) is highly predictive of cardiovascular events. While millions of chest CT scans are performed annually in the United States, CAC is not routinely quantified from scans done for non-cardiac purposes. A deep learning algorithm was developed using 446 expert segmentations to automatically quantify CAC on non-contrast, non-gated CT scans (AI-CAC). Our study differs from prior works as we leverage imaging data across the Veterans Affairs national healthcare system, from 98 medical centers, capturing extensive heterogeneity in imaging protocols, scanners, and patients. AI-CAC performance on non-gated scans was compared against clinical standard ECG-gated CAC scoring. Non-gated AI-CAC differentiated zero vs. non-zero and less than 100 vs. 100 or greater Agatston scores with accuracies of 89.4% (F1 0.93) and 87.3% (F1 0.89), respectively, in 795 patients with paired gated scans within a year of a non-gated CT scan. Non-gated AI-CAC was predictive of 10-year all-cause mortality (CAC 0 vs. 400 group: 25.4% vs. 60.2%, Cox HR 3.49, p 0.005), and composite first-time stroke, MI, or death (CAC 0 vs. 400 group: 33.5% vs. 63.8%, Cox HR 3.00, p 0.005). In a screening dataset of 8,052 patients with low-dose lung cancer-screening CTs (LDCT), 3,091/8,052 (38.4%) individuals had AI-CAC 400. Four cardiologists qualitatively reviewed LDCT images from a random sample of 400 AI-CAC patients and verified that 527/531 (99.2%) would benefit from lipid-lowering therapy. To the best of our knowledge, this is the first non-gated CT CAC algorithm developed across a national healthcare system, on multiple imaging protocols, without filtering intra-cardiac hardware, and compared against a strong gated CT reference. We report superior performance relative to previous CAC algorithms evaluated against paired gated scans that included patients with intra-cardiac hardware.

[CV-53] Uncertainty-Guided Appearance-Motion Association Network for Out-of-Distribution Action Detection

链接: https://arxiv.org/abs/2409.09953
作者: Xiang Fang,Arvind Easwaran,Blaise Genest
关键词-EN: producing unreliable predictions, reject test samples, prevent models trained, OOD action detection, semantic shifts
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by MIPR 2024

点击查看摘要

Abstract:Out-of-distribution (OOD) detection targets to detect and reject test samples with semantic shifts, to prevent models trained on in-distribution (ID) dataset from producing unreliable predictions. Existing works only extract the appearance features on image datasets, and cannot handle dynamic multimedia scenarios with much motion information. Therefore, we target a more realistic and challenging OOD detection task: OOD action detection (ODAD). Given an untrimmed video, ODAD first classifies the ID actions and recognizes the OOD actions, and then localizes ID and OOD actions. To this end, in this paper, we propose a novel Uncertainty-Guided Appearance-Motion Association Network (UAAN), which explores both appearance features and motion contexts to reason spatial-temporal inter-object interaction for ODAD.Firstly, we design separate appearance and motion branches to extract corresponding appearance-oriented and motion-aspect object representations. In each branch, we construct a spatial-temporal graph to reason appearance-guided and motion-driven inter-object interaction. Then, we design an appearance-motion attention module to fuse the appearance and motion features for final action detection. Experimental results on two challenging datasets show that UAAN beats state-of-the-art methods by a significant margin, illustrating its effectiveness.

[CV-54] owards Real-Time Generation of Delay-Compensated Video Feeds for Outdoor Mobile Robot Teleoperation

链接: https://arxiv.org/abs/2409.09921
作者: Neeloy Chakraborty,Yixiao Fang,Andre Schreiber,Tianchen Ji,Zhe Huang,Aganze Mihigo,Cassidy Wall,Abdulrahman Almana,Katherine Driggs-Campbell
关键词-EN: agricultural robots remotely, control agricultural robots, important technology, technology to enable, control agricultural
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Teleoperation is an important technology to enable supervisors to control agricultural robots remotely. However, environmental factors in dense crop rows and limitations in network infrastructure hinder the reliability of data streamed to teleoperators. These issues result in delayed and variable frame rate video feeds that often deviate significantly from the robot’s actual viewpoint. We propose a modular learning-based vision pipeline to generate delay-compensated images in real-time for supervisors. Our extensive offline evaluations demonstrate that our method generates more accurate images compared to state-of-the-art approaches in our setting. Additionally, we are one of the few works to evaluate a delay-compensation method in outdoor field environments with complex terrain on data from a real robot in real-time. Additional videos are provided at this https URL.

[CV-55] Forearm Ultrasound based Gesture Recognition on Edge

链接: https://arxiv.org/abs/2409.09915
作者: Keshav Bimbraw,Haichong K. Zhang,Bashima Islam
关键词-EN: demonstrated significant potential, Ultrasound imaging, hand gesture classification, accurate hand gesture, demonstrated significant
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Please contact the authors for code and any additional questions pertaining to the project. You can reach Keshav Bimbraw at bimbrawkeshav at gmail dot com

点击查看摘要

Abstract:Ultrasound imaging of the forearm has demonstrated significant potential for accurate hand gesture classification. Despite this progress, there has been limited focus on developing a stand-alone end- to-end gesture recognition system which makes it mobile, real-time and more user friendly. To bridge this gap, this paper explores the deployment of deep neural networks for forearm ultrasound-based hand gesture recognition on edge devices. Utilizing quantization techniques, we achieve substantial reductions in model size while maintaining high accuracy and low latency. Our best model, with Float16 quantization, achieves a test accuracy of 92% and an inference time of 0.31 seconds on a Raspberry Pi. These results demonstrate the feasibility of efficient, real-time gesture recognition on resource-limited edge devices, paving the way for wearable ultrasound-based systems.

[CV-56] Rapid Adaptation of Earth Observation Foundation Models for Segmentation

链接: https://arxiv.org/abs/2409.09907
作者: Karthick Panner Selvam,Raul Ramos-Pollan,Freddie Kalaitzis
关键词-EN: fine-tuning Earth Observation, Earth Observation, fine-tuning Earth, study investigates, investigates the efficacy
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages 2 figures

点击查看摘要

Abstract:This study investigates the efficacy of Low-Rank Adaptation (LoRA) in fine-tuning Earth Observation (EO) foundation models for flood segmentation. We hypothesize that LoRA, a parameter-efficient technique, can significantly accelerate the adaptation of large-scale EO models to this critical task while maintaining high performance. We apply LoRA to fine-tune a state-of-the-art EO foundation model pre-trained on diverse satellite imagery, using a curated dataset of flood events. Our results demonstrate that LoRA-based fine-tuning (r-256) improves F1 score by 6.66 points and IoU by 0.11 compared to a frozen encoder baseline, while significantly reducing computational costs. Notably, LoRA outperforms full fine-tuning, which proves computationally infeasible on our hardware. We further assess generalization through out-of-distribution (OOD) testing on a geographically distinct flood event. While LoRA configurations show improved OOD performance over the baseline. This work contributes to research on efficient adaptation of foundation models for specialized EO tasks, with implications for rapid response systems in disaster management. Our findings demonstrate LoRA’s potential for enabling faster deployment of accurate flood segmentation models in resource-constrained, time-critical scenarios.

[CV-57] Enhancing Visual Inertial SLAM with Magnetic Measurements

链接: https://arxiv.org/abs/2409.09904
作者: Bharat Joshi,Ioannis Rekleitis
关键词-EN: introducing tightly-coupled fusion, visual inertial odometry, introducing tightly-coupled, tightly-coupled fusion, relative magnetometer orientation
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper presents an extension to visual inertial odometry (VIO) by introducing tightly-coupled fusion of magnetometer measurements. A sliding window of keyframes is optimized by minimizing re-projection errors, relative inertial errors, and relative magnetometer orientation errors. The results of IMU orientation propagation are used to efficiently transform magnetometer measurements between frames producing relative orientation constraints between consecutive frames. The soft and hard iron effects are calibrated using an ellipsoid fitting algorithm. The introduction of magnetometer data results in significant reductions in the orientation error and also in recovery of the true yaw orientation with respect to the magnetic north. The proposed framework operates in all environments with slow-varying magnetic fields, mainly outdoors and underwater. We have focused our work on the underwater domain, especially in underwater caves, as the narrow passage and turbulent flow make it difficult to perform loop closures and reset the localization drift. The underwater caves present challenges to VIO due to the absence of ambient light and the confined nature of the environment, while also being a crucial source of fresh water and providing valuable historical records. Experimental results from underwater caves demonstrate the improvements in accuracy and robustness introduced by the proposed VIO extension.

[CV-58] GRIN: Zero-Shot Metric Depth with Pixel-Level Diffusion

链接: https://arxiv.org/abs/2409.09896
作者: Vitor Guizilini,Pavel Tokmakov,Achal Dave,Rares Ambrus
关键词-EN: computer vision, long-standing problem, problem in computer, Abstract, single image
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:3D reconstruction from a single image is a long-standing problem in computer vision. Learning-based methods address its inherent scale ambiguity by leveraging increasingly large labeled and unlabeled datasets, to produce geometric priors capable of generating accurate predictions across domains. As a result, state of the art approaches show impressive performance in zero-shot relative and metric depth estimation. Recently, diffusion models have exhibited remarkable scalability and generalizable properties in their learned representations. However, because these models repurpose tools originally designed for image generation, they can only operate on dense ground-truth, which is not available for most depth labels, especially in real-world settings. In this paper we present GRIN, an efficient diffusion model designed to ingest sparse unstructured training data. We use image features with 3D geometric positional encodings to condition the diffusion process both globally and locally, generating depth predictions at a pixel-level. With comprehensive experiments across eight indoor and outdoor datasets, we show that GRIN establishes a new state of the art in zero-shot metric monocular depth estimation even when trained from scratch.

[CV-59] Resolving Inconsistent Semantics in Multi-Dataset Image Segmentation

链接: https://arxiv.org/abs/2409.09893
作者: Qilong Zhangli,Di Liu,Abhishek Aich,Dimitris Metaxas,Samuel Schulter
关键词-EN: Leveraging multiple training, Leveraging multiple, image segmentation models, scale up image, models is beneficial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Leveraging multiple training datasets to scale up image segmentation models is beneficial for increasing robustness and semantic understanding. Individual datasets have well-defined ground truth with non-overlapping mask layouts and mutually exclusive semantics. However, merging them for multi-dataset training disrupts this harmony and leads to semantic inconsistencies; for example, the class “person” in one dataset and class “face” in another will require multilabel handling for certain pixels. Existing methods struggle with this setting, particularly when evaluated on label spaces mixed from the individual training sets. To overcome these issues, we introduce a simple yet effective multi-dataset training approach by integrating language-based embeddings of class names and label space-specific query embeddings. Our method maintains high performance regardless of the underlying inconsistencies between training datasets. Notably, on four benchmark datasets with label space inconsistencies during inference, we outperform previous methods by 1.6% mIoU for semantic segmentation, 9.1% PQ for panoptic segmentation, 12.1% AP for instance segmentation, and 3.0% in the newly proposed PIQ metric.

[CV-60] REG: Refined Generalized Focal Loss for Road Asset Detection on Thai Highways Using Vision-Based Detection and Segmentation Models

链接: https://arxiv.org/abs/2409.09877
作者: Teerapong Panboonyuen
关键词-EN: Refined Generalized Focal, Generalized Focal Loss, advanced Refined Generalized, Refined Generalized, Generalized Focal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 14 pages

点击查看摘要

Abstract:This paper introduces a novel framework for detecting and segmenting critical road assets on Thai highways using an advanced Refined Generalized Focal Loss (REG) formulation. Integrated into state-of-the-art vision-based detection and segmentation models, the proposed method effectively addresses class imbalance and the challenges of localizing small, underrepresented road elements, including pavilions, pedestrian bridges, information signs, single-arm poles, bus stops, warning signs, and concrete guardrails. To improve both detection and segmentation accuracy, a multi-task learning strategy is adopted, optimizing REG across multiple tasks. REG is further enhanced by incorporating a spatial-contextual adjustment term, which accounts for the spatial distribution of road assets, and a probabilistic refinement that captures prediction uncertainty in complex environments, such as varying lighting conditions and cluttered backgrounds. Our rigorous mathematical formulation demonstrates that REG minimizes localization and classification errors by applying adaptive weighting to hard-to-detect instances while down-weighting easier examples. Experimental results show a substantial performance improvement, achieving a mAP50 of 80.34 and an F1-score of 77.87, significantly outperforming conventional methods. This research underscores the capability of advanced loss function refinements to enhance the robustness and accuracy of road asset detection and segmentation, thereby contributing to improved road safety and infrastructure management. For an in-depth discussion of the mathematical background and related methods, please refer to previous work available at \urlthis https URL.

[CV-61] owards Kinetic Manipulation of the Latent Space

链接: https://arxiv.org/abs/2409.09867
作者: Diego Porres
关键词-EN: Graphical User Interfaces, valleys and mountains, Convolutional Neural Networks, generative models, models are rich
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The latent space of many generative models are rich in unexplored valleys and mountains. The majority of tools used for exploring them are so far limited to Graphical User Interfaces (GUIs). While specialized hardware can be used for this task, we show that a simple feature extraction of pre-trained Convolutional Neural Networks (CNNs) from a live RGB camera feed does a very good job at manipulating the latent space with simple changes in the scene, with vast room for improvement. We name this new paradigm Visual-reactive Interpolation, and the full code can be found at this https URL.

[CV-62] Revisiting Physical-World Adversarial Attack on Traffic Sign Recognition: A Commercial Systems Perspective NDSS2025

链接: https://arxiv.org/abs/2409.09860
作者: Ningfei Wang,Shaoyuan Xie,Takami Sato,Yunpeng Luo,Kaidi Xu,Qi Alfred Chen
关键词-EN: Traffic Sign Recognition, correct driving automation, Sign Recognition, commercial TSR systems, Traffic Sign
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NDSS 2025

点击查看摘要

Abstract:Traffic Sign Recognition (TSR) is crucial for safe and correct driving automation. Recent works revealed a general vulnerability of TSR models to physical-world adversarial attacks, which can be low-cost, highly deployable, and capable of causing severe attack effects such as hiding a critical traffic sign or spoofing a fake one. However, so far existing works generally only considered evaluating the attack effects on academic TSR models, leaving the impacts of such attacks on real-world commercial TSR systems largely unclear. In this paper, we conduct the first large-scale measurement of physical-world adversarial attacks against commercial TSR systems. Our testing results reveal that it is possible for existing attack works from academia to have highly reliable (100%) attack success against certain commercial TSR system functionality, but such attack capabilities are not generalizable, leading to much lower-than-expected attack success rates overall. We find that one potential major factor is a spatial memorization design that commonly exists in today’s commercial TSR systems. We design new attack success metrics that can mathematically model the impacts of such design on the TSR system-level attack success, and use them to revisit existing attacks. Through these efforts, we uncover 7 novel observations, some of which directly challenge the observations or claims in prior works due to the introduction of the new metrics.

[CV-63] racking Virtual Meetings in the Wild: Re-identification in Multi-Participant Virtual Meetings ECCV2024

链接: https://arxiv.org/abs/2409.09841
作者: Oriel Perl,Ido Leshem,Uria Franko,Yuval Goldman
关键词-EN: widely adopted virtual, recent years, workplaces and educational, educational institutes, institutes have widely
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024 workshop

点击查看摘要

Abstract:In recent years, workplaces and educational institutes have widely adopted virtual meeting platforms. This has led to a growing interest in analyzing and extracting insights from these meetings, which requires effective detection and tracking of unique individuals. In practice, there is no standardization in video meetings recording layout, and how they are captured across the different platforms and services. This, in turn, creates a challenge in acquiring this data stream and analyzing it in a uniform fashion. Our approach provides a solution to the most general form of video recording, usually consisting of a grid of participants (\creffig:videomeeting) from a single video source with no metadata on participant locations, while using the least amount of constraints and assumptions as to how the data was acquired. Conventional approaches often use YOLO models coupled with tracking algorithms, assuming linear motion trajectories akin to that observed in CCTV footage. However, such assumptions fall short in virtual meetings, where participant video feed window can abruptly change location across the grid. In an organic video meeting setting, participants frequently join and leave, leading to sudden, non-linear movements on the video grid. This disrupts optical flow-based tracking methods that depend on linear motion. Consequently, standard object detection and tracking methods might mistakenly assign multiple participants to the same tracker. In this paper, we introduce a novel approach to track and re-identify participants in remote video meetings, by utilizing the spatio-temporal priors arising from the data in our domain. This, in turn, increases tracking capabilities compared to the use of general object tracking. Our approach reduces the error rate by 95% on average compared to YOLO-based tracking methods as a baseline.

[CV-64] mplate-based Multi-Domain Face Recognition ALT

链接: https://arxiv.org/abs/2409.09832
作者: Anirudh Nanduri,Rama Chellappa
关键词-EN: challenging non-visible domains, deep neural networks, comparatively still lacking, remarkable performance, deep neural
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: IJCB 2024 - Special Session on Recognition at Long Range and from High Altitude

点击查看摘要

Abstract:Despite the remarkable performance of deep neural networks for face detection and recognition tasks in the visible spectrum, their performance on more challenging non-visible domains is comparatively still lacking. While significant research has been done in the fields of domain adaptation and domain generalization, in this paper we tackle scenarios in which these methods have limited applicability owing to the lack of training data from target domains. We focus on the problem of single-source (visible) and multi-target (SWIR, long-range/remote, surveillance, and body-worn) face recognition task. We show through experiments that a good template generation algorithm becomes crucial as the complexity of the target domain increases. In this context, we introduce a template generation algorithm called Norm Pooling (and a variant known as Sparse Pooling) and show that it outperforms average pooling across different domains and networks, on the IARPA JANUS Benchmark Multi-domain Face (IJB-MDF) dataset.

[CV-65] NARF24: Estimating Articulated Object Structure for Implicit Rendering ICRA

链接: https://arxiv.org/abs/2409.09829
作者: Stanley Lewis,Tom Gao,Odest Chadwicke Jenkins
关键词-EN: problem for robots, Neural Radiance Field, pose a difficult, difficult problem, common Neural Radiance
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: extended abstract as submitted to ICRA@40 anniversary conference

点击查看摘要

Abstract:Articulated objects and their representations pose a difficult problem for robots. These objects require not only representations of geometry and texture, but also of the various connections and joint parameters that make up each articulation. We propose a method that learns a common Neural Radiance Field (NeRF) representation across a small number of collected scenes. This representation is combined with a parts-based image segmentation to produce an implicit space part localization, from which the connectivity and joint parameters of the articulated object can be estimated, thus enabling configuration-conditioned rendering.

[CV-66] Famba-V: Fast Vision Mamba with Cross-Layer Token Fusion ECCV2024

链接: https://arxiv.org/abs/2409.09808
作者: Hui Shen,Zhongwei Wan,Xin Wang,Mi Zhang
关键词-EN: Transformer architecture, introduces Fast Mamba, Vision Mamba, based on Transformer, Vim models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Camera ready version of ECCV 2024 The Fourth Workshop on Computational Aspects of Deep Learning

点击查看摘要

Abstract:Mamba and Vision Mamba (Vim) models have shown their potential as an alternative to methods based on Transformer architecture. This work introduces Fast Mamba for Vision (Famba-V), a cross-layer token fusion technique to enhance the training efficiency of Vim models. The key idea of Famba-V is to identify and fuse similar tokens across different Vim layers based on a suit of cross-layer strategies instead of simply applying token fusion uniformly across all the layers that existing works propose. We evaluate the performance of Famba-V on CIFAR-100. Our results show that Famba-V is able to enhance the training efficiency of Vim models by reducing both training time and peak memory usage during training. Moreover, the proposed cross-layer strategies allow Famba-V to deliver superior accuracy-efficiency trade-offs. These results all together demonstrate Famba-V as a promising efficiency enhancement technique for Vim models.

[CV-67] Abnormal Event Detection In Videos Using Deep Embedding

链接: https://arxiv.org/abs/2409.09804
作者: Darshan Venkatrayappa
关键词-EN: Abnormal event detection, anomaly detection, Abnormal event, video anomaly detection, anomaly detection requires
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Abnormal event detection or anomaly detection in surveillance videos is currently a challenge because of the diversity of possible events. Due to the lack of anomalous events at training time, anomaly detection requires the design of learning methods without supervision. In this work we propose an unsupervised approach for video anomaly detection with the aim to jointly optimize the objectives of the deep neural network and the anomaly detection task using a hybrid architecture. Initially, a convolutional autoencoder is pre-trained in an unsupervised manner with a fusion of depth, motion and appearance features. In the second step, we utilize the encoder part of the pre-trained autoencoder and extract the embeddings of the fused input. Now, we jointly train/ fine tune the encoder to map the embeddings to a hypercenter. Thus, embeddings of normal data fall near the hypercenter, whereas embeddings of anomalous data fall far away from the hypercenter.

[CV-68] Multiple Rotation Averaging with Constrained Reweighting Deep Matrix Factorization

链接: https://arxiv.org/abs/2409.09790
作者: Shiqi Li,Jihua Zhu,Yifan Xie,Naiwen Hu,Mingchen Zhu,Zhongyu Li,Di Wang
关键词-EN: Multiple rotation averaging, rotation averaging plays, rotation averaging, robotics domains, plays a crucial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Multiple rotation averaging plays a crucial role in computer vision and robotics domains. The conventional optimization-based methods optimize a nonlinear cost function based on certain noise assumptions, while most previous learning-based methods require ground truth labels in the supervised training process. Recognizing the handcrafted noise assumption may not be reasonable in all real-world scenarios, this paper proposes an effective rotation averaging method for mining data patterns in a learning manner while avoiding the requirement of labels. Specifically, we apply deep matrix factorization to directly solve the multiple rotation averaging problem in unconstrained linear space. For deep matrix factorization, we design a neural network model, which is explicitly low-rank and symmetric to better suit the background of multiple rotation averaging. Meanwhile, we utilize a spanning tree-based edge filtering to suppress the influence of rotation outliers. What’s more, we also adopt a reweighting scheme and dynamic depth selection strategy to further improve the robustness. Our method synthesizes the merit of both optimization-based and learning-based methods. Experimental results on various datasets validate the effectiveness of our proposed method.

[CV-69] Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models

链接: https://arxiv.org/abs/2409.09788
作者: Yuan-Hong Liao,Rafid Mahmood,Sanja Fidler,David Acuna
关键词-EN: demonstrating vision-language models’, recent advances demonstrating, advances demonstrating vision-language, describe complex relationships, distances remains underexplored
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: 20 pages, 13 figures

点击查看摘要

Abstract:Despite recent advances demonstrating vision-language models’ (VLMs) abilities to describe complex relationships in images using natural language, their capability to quantitatively reason about object sizes and distances remains underexplored. In this work, we introduce a manually annotated benchmark, Q-Spatial Bench, with 271 questions across five categories designed for quantitative spatial reasoning and systematically investigate the performance of state-of-the-art VLMs on this task. Our analysis reveals that reasoning about distances between objects is particularly challenging for SoTA VLMs; however, some VLMs significantly outperform others, with an over 40-point gap between the two best performing models. We also make the surprising observation that the success rate of the top-performing VLM increases by 19 points when a reasoning path using a reference object emerges naturally in the response. Inspired by this observation, we develop a zero-shot prompting technique, SpatialPrompt, that encourages VLMs to answer quantitative spatial questions using reference objects as visual cues. By instructing VLMs to use reference objects in their reasoning paths via SpatialPrompt, Gemini 1.5 Pro, Gemini 1.5 Flash, and GPT-4V improve their success rates by over 40, 20, and 30 points, respectively. We emphasize that these significant improvements are obtained without needing more data, model architectural modifications, or fine-tuning.

[CV-70] Enhancing Lesion Segmentation in PET/CT Imaging with Deep Learning and Advanced Data Preprocessing Techniques

链接: https://arxiv.org/abs/2409.09784
作者: Jiayi Liu,Qiaoyi Xue,Youdan Feng,Tianming Xu,Kaixin Shen,Chuyun Shen,Yuhang Shi
关键词-EN: escalating global cancer, global cancer burden, cancer burden underscores, precise diagnostic tools, tools in oncology
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The escalating global cancer burden underscores the critical need for precise diagnostic tools in oncology. This research employs deep learning to enhance lesion segmentation in PET/CT imaging, utilizing a dataset of 900 whole-body FDG-PET/CT and 600 PSMA-PET/CT studies from the AutoPET challenge III. Our methodical approach includes robust preprocessing and data augmentation techniques to ensure model robustness and generalizability. We investigate the influence of non-zero normalization and modifications to the data augmentation pipeline, such as the introduction of RandGaussianSharpen and adjustments to the Gamma transform parameter. This study aims to contribute to the standardization of preprocessing and augmentation strategies in PET/CT imaging, potentially improving the diagnostic accuracy and the personalized management of cancer patients. Our code will be open-sourced and available at this https URL.

[CV-71] Underwater Image Enhancement via Dehazing and Color Restoration

链接: https://arxiv.org/abs/2409.09779
作者: Chengqin Wu,Shuai Yu,Qingson Hu,Jingxiang Xu,Lijun Zhang
关键词-EN: marine engineering projects, marine resource extraction, marine engineering, marine resource, Color Restoration Block
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:With the rapid development of marine engineering projects such as marine resource extraction and oceanic surveys, underwater visual imaging and analysis has become a critical technology. Unfortunately, due to the inevitable non-linear attenuation of light in underwater environments, underwater images and videos often suffer from low contrast, blurriness, and color degradation, which significantly complicate the subsequent research. Existing underwater image enhancement methods often treat the haze and color cast as a unified degradation process and disregard their independence and interdependence, which limits the performance improvement. Here, we propose a Vision Transformer (ViT)-based network (referred to as WaterFormer) to improve the underwater image quality. WaterFormer contains three major components: a dehazing block (DehazeFormer Block) to capture the self-correlated haze features and extract deep-level features, a Color Restoration Block (CRB) to capture self-correlated color cast features, and a Channel Fusion Block (CFB) to capture fusion features within the network. To ensure authenticity, a soft reconstruction layer based on the underwater imaging physics model is included. To improve the quality of the enhanced images, we introduce the Chromatic Consistency Loss and Sobel Color Loss to train the network. Comprehensive experimental results demonstrate that WaterFormer outperforms other state-of-the-art methods in enhancing underwater images.

[CV-72] DiFSD: Ego-Centric Fully Sparse Paradigm with Uncertainty Denoising and Iterative Refinement for Efficient End-to-End Autonomous Driving

链接: https://arxiv.org/abs/2409.09777
作者: Haisheng Su,Wei Wu,Junchi Yan
关键词-EN: unifying modular designs, autonomous driving methods, driving methods resort, methods resort, resort to unifying
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Current end-to-end autonomous driving methods resort to unifying modular designs for various tasks (e.g. perception, prediction and planning). Although optimized in a planning-oriented spirit with a fully differentiable framework, existing end-to-end driving systems without ego-centric designs still suffer from unsatisfactory performance and inferior efficiency, owing to the rasterized scene representation learning and redundant information transmission. In this paper, we revisit the human driving behavior and propose an ego-centric fully sparse paradigm, named DiFSD, for end-to-end self-driving. Specifically, DiFSD mainly consists of sparse perception, hierarchical interaction and iterative motion planner. The sparse perception module performs detection, tracking and online mapping based on sparse representation of the driving scene. The hierarchical interaction module aims to select the Closest In-Path Vehicle / Stationary (CIPV / CIPS) from coarse to fine, benefiting from an additional geometric prior. As for the iterative motion planner, both selected interactive agents and ego-vehicle are considered for joint motion prediction, where the output multi-modal ego-trajectories are optimized in an iterative fashion. Besides, both position-level motion diffusion and trajectory-level planning denoising are introduced for uncertainty modeling, thus facilitating the training stability and convergence of the whole framework. Extensive experiments conducted on nuScenes dataset demonstrate the superior planning performance and great efficiency of DiFSD, which significantly reduces the average L2 error by \textbf66% and collision rate by \textbf77% than UniAD while achieves \textbf8.2 \times faster running efficiency.

[CV-73] Generalizing Alignment Paradigm of Text-to-Image Generation with Preferences through f-divergence Minimization

链接: https://arxiv.org/abs/2409.09774
作者: Haoyuan Sun,Bo Xia,Yongzhe Chang,Xueqian Wang
关键词-EN: Direct Preference Optimization, Direct Preference, Preference Optimization, generated considerable interest, large language models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 32 pages

点击查看摘要

Abstract:Direct Preference Optimization (DPO) has recently expanded its successful application from aligning large language models (LLMs) to aligning text-to-image models with human preferences, which has generated considerable interest within the community. However, we have observed that these approaches rely solely on minimizing the reverse Kullback-Leibler divergence during alignment process between the fine-tuned model and the reference model, neglecting the incorporation of other divergence constraints. In this study, we focus on extending reverse Kullback-Leibler divergence in the alignment paradigm of text-to-image models to f -divergence, which aims to garner better alignment performance as well as good generation diversity. We provide the generalized formula of the alignment paradigm under the f -divergence condition and thoroughly analyze the impact of different divergence constraints on alignment process from the perspective of gradient fields. We conduct comprehensive evaluation on image-text alignment performance, human value alignment performance and generation diversity performance under different divergence constraints, and the results indicate that alignment based on Jensen-Shannon divergence achieves the best trade-off among them. The option of divergence employed for aligning text-to-image models significantly impacts the trade-off between alignment performance (especially human value alignment) and generation diversity, which highlights the necessity of selecting an appropriate divergence for practical applications.

[CV-74] Automated Lesion Segmentation in Whole-Body PET/CT in a multitracer setting

链接: https://arxiv.org/abs/2409.09766
作者: Qiaoyi Xue,Youdan Feng,Jiayi Liu,Tianming Xu,Kaixin Shen,Chuyun Shen,Yuhang Shi
关键词-EN: FDG and PSMA, PSMA PET, FDG, PSMA, PSMA images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study explores a workflow for automated segmentation of lesions in FDG and PSMA PET/CT images. Due to the substantial differences in image characteristics between FDG and PSMA, specialized preprocessing steps are required. Utilizing YOLOv8 for data classification, the FDG and PSMA images are preprocessed separately before feeding them into the segmentation models, aiming to improve lesion segmentation accuracy. The study focuses on evaluating the performance of automated segmentation workflow for multitracer PET images. The findings are expected to provide critical insights for enhancing diagnostic workflows and patient-specific treatment plans. Our code will be open-sourced and available at this https URL.

[CV-75] MesonGS: Post-training Compression of 3D Gaussians via Efficient Attribute Transformation ECCV2024

链接: https://arxiv.org/abs/2409.09756
作者: Shuzhao Xie,Weixiang Zhang,Chen Tang,Yunpeng Bai,Rongwei Lu,Shijia Ge,Zhi Wang
关键词-EN: Splatting demonstrates excellent, Gaussian Splatting demonstrates, Gaussian Splatting, view synthesis, Splatting demonstrates
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 8 figures, ECCV 2024

点击查看摘要

Abstract:3D Gaussian Splatting demonstrates excellent quality and speed in novel view synthesis. Nevertheless, the huge file size of the 3D Gaussians presents challenges for transmission and storage. Current works design compact models to replace the substantial volume and attributes of 3D Gaussians, along with intensive training to distill information. These endeavors demand considerable training time, presenting formidable hurdles for practical deployment. To this end, we propose MesonGS, a codec for post-training compression of 3D Gaussians. Initially, we introduce a measurement criterion that considers both view-dependent and view-independent factors to assess the impact of each Gaussian point on the rendering output, enabling the removal of insignificant points. Subsequently, we decrease the entropy of attributes through two transformations that complement subsequent entropy coding techniques to enhance the file compression rate. More specifically, we first replace rotation quaternions with Euler angles; then, we apply region adaptive hierarchical transform to key attributes to reduce entropy. Lastly, we adopt finer-grained quantization to avoid excessive information loss. Moreover, a well-crafted finetune scheme is devised to restore quality. Extensive experiments demonstrate that MesonGS significantly reduces the size of 3D Gaussians while preserving competitive quality.

[CV-76] owards Single-Lens Controllable Depth-of-Field Imaging via All-in-Focus Aberration Correction and Monocular Depth Estimation

链接: https://arxiv.org/abs/2409.09754
作者: Xiaolong Qian,Qi Jiang,Yao Gao,Shaohua Gao,Zhonghua Yi,Lei Sun,Kai Wei,Haifeng Li,Kailun Yang,Kaiwei Wang,Jian Bai
关键词-EN: controllable DoF imaging, amazing visual effects, single-lens controllable DoF, Minimalist Optical Systems, controllable DoF
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV); Optics (physics.optics)
*备注: The source code and the established dataset will be publicly available at this https URL

点击查看摘要

Abstract:Controllable Depth-of-Field (DoF) imaging commonly produces amazing visual effects based on heavy and expensive high-end lenses. However, confronted with the increasing demand for mobile scenarios, it is desirable to achieve a lightweight solution with Minimalist Optical Systems (MOS). This work centers around two major limitations of MOS, i.e., the severe optical aberrations and uncontrollable DoF, for achieving single-lens controllable DoF imaging via computational methods. A Depth-aware Controllable DoF Imaging (DCDI) framework is proposed equipped with All-in-Focus (AiF) aberration correction and monocular depth estimation, where the recovered image and corresponding depth map are utilized to produce imaging results under diverse DoFs of any high-end lens via patch-wise convolution. To address the depth-varying optical degradation, we introduce a Depth-aware Degradation-adaptive Training (DA2T) scheme. At the dataset level, a Depth-aware Aberration MOS (DAMOS) dataset is established based on the simulation of Point Spread Functions (PSFs) under different object distances. Additionally, we design two plug-and-play depth-aware mechanisms to embed depth information into the aberration image recovery for better tackling depth-aware degradation. Furthermore, we propose a storage-efficient Omni-Lens-Field model to represent the 4D PSF library of various lenses. With the predicted depth map, recovered image, and depth-aware PSF map inferred by Omni-Lens-Field, single-lens controllable DoF imaging is achieved. Comprehensive experimental results demonstrate that the proposed framework enhances the recovery performance, and attains impressive single-lens controllable DoF imaging results, providing a seminal baseline for this field. The source code and the established dataset will be publicly available at this https URL.

[CV-77] DARDA: Domain-Aware Real-Time Dynamic Neural Network Adaptation

链接: https://arxiv.org/abs/2409.09753
作者: Shahriar Rifat,Jonathan Ashdown,Francesco Restuccia
关键词-EN: Deep Neural Networks, Test Time Adaptation, Test Time, Neural Networks, Deep Neural
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Test Time Adaptation (TTA) has emerged as a practical solution to mitigate the performance degradation of Deep Neural Networks (DNNs) in the presence of corruption/ noise affecting inputs. Existing approaches in TTA continuously adapt the DNN, leading to excessive resource consumption and performance degradation due to accumulation of error stemming from lack of supervision. In this work, we propose Domain-Aware Real-Time Dynamic Adaptation (DARDA) to address such issues. Our key approach is to proactively learn latent representations of some corruption types, each one associated with a sub-network state tailored to correctly classify inputs affected by that corruption. After deployment, DARDA adapts the DNN to previously unseen corruptions in an unsupervised fashion by (i) estimating the latent representation of the ongoing corruption; (ii) selecting the sub-network whose associated corruption is the closest in the latent space to the ongoing corruption; and (iii) adapting DNN state, so that its representation matches the ongoing corruption. This way, DARDA is more resource efficient and can swiftly adapt to new distributions caused by different corruptions without requiring a large variety of input data. Through experiments with two popular mobile edge devices - Raspberry Pi and NVIDIA Jetson Nano - we show that DARDA reduces energy consumption and average cache memory footprint respectively by 1.74x and 2.64x with respect to the state of the art, while increasing the performance by 10.4%, 5.7% and 4.4% on CIFAR-10, CIFAR-100 and TinyImagenet.

[CV-78] Explore the Hallucination on Low-level Perception for MLLMs

链接: https://arxiv.org/abs/2409.09748
作者: Yinan Sun,Zicheng Zhang,Haoning Wu,Xiaohong Liu,Weisi Lin,Guangtao Zhai,Xiongkuo Min
关键词-EN: Multi-modality Large Language, Large Language Models, Multi-modality Large, Large Language, development of Multi-modality
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid development of Multi-modality Large Language Models (MLLMs) has significantly influenced various aspects of industry and daily life, showcasing impressive capabilities in visual perception and understanding. However, these models also exhibit hallucinations, which limit their reliability as AI systems, especially in tasks involving low-level visual perception and understanding. We believe that hallucinations stem from a lack of explicit self-awareness in these models, which directly impacts their overall performance. In this paper, we aim to define and evaluate the self-awareness of MLLMs in low-level visual perception and understanding tasks. To this end, we present QL-Bench, a benchmark settings to simulate human responses to low-level vision, investigating self-awareness in low-level visual perception through visual question answering related to low-level attributes such as clarity and lighting. Specifically, we construct the LLSAVisionQA dataset, comprising 2,990 single images and 1,999 image pairs, each accompanied by an open-ended question about its low-level features. Through the evaluation of 15 MLLMs, we demonstrate that while some models exhibit robust low-level visual capabilities, their self-awareness remains relatively underdeveloped. Notably, for the same model, simpler questions are often answered more accurately than complex ones. However, self-awareness appears to improve when addressing more challenging questions. We hope that our benchmark will motivate further research, particularly focused on enhancing the self-awareness of MLLMs in tasks involving low-level visual perception and understanding.

[CV-79] VGG-Tex: A Vivid Geometry-Guided Facial Texture Estimation Model for High Fidelity Monocular 3D Face Reconstruction

链接: https://arxiv.org/abs/2409.09740
作者: Haoyu Wu,Ziqiao Peng,Xukun Zhou,Yunfei Cheng,Jun He,Hongyan Liu,Zhaoxin Fan
关键词-EN: augmented reality, face reconstruction, High Fidelity Monocular, images has promoted, promoted the development
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D face reconstruction from monocular images has promoted the development of various applications such as augmented reality. Though existing methods have made remarkable progress, most of them emphasize geometric reconstruction, while overlooking the importance of texture prediction. To address this issue, we propose VGG-Tex, a novel Vivid Geometry-Guided Facial Texture Estimation model designed for High Fidelity Monocular 3D Face Reconstruction. The core of this approach is leveraging 3D parametric priors to enhance the outcomes of 2D UV texture estimation. Specifically, VGG-Tex includes a Facial Attributes Encoding Module, a Geometry-Guided Texture Generator, and a Visibility-Enhanced Texture Completion Module. These components are responsible for extracting parametric priors, generating initial textures, and refining texture details, respectively. Based on the geometry-texture complementarity principle, VGG-Tex also introduces a Texture-guided Geometry Refinement Module to further balance the overall fidelity of the reconstructed 3D faces, along with corresponding losses. Comprehensive experiments demonstrate that our method significantly improves texture reconstruction performance compared to existing state-of-the-art methods.

[CV-80] Precise Pick-and-Place using Score-Based Diffusion Networks

链接: https://arxiv.org/abs/2409.09725
作者: Shih-Wei Guo,Tsu-Ching Hsiao,Yu-Lun Liu,Chun-Yi Lee
关键词-EN: operations within robotic, pose diffusion method, diffusion method, robotic manipulation tasks, accurate perception
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 7 figures. Project webpage: this https URL

点击查看摘要

Abstract:In this paper, we propose a novel coarse-to-fine continuous pose diffusion method to enhance the precision of pick-and-place operations within robotic manipulation tasks. Leveraging the capabilities of diffusion networks, we facilitate the accurate perception of object poses. This accurate perception enhances both pick-and-place success rates and overall manipulation precision. Our methodology utilizes a top-down RGB image projected from an RGB-D camera and adopts a coarse-to-fine architecture. This architecture enables efficient learning of coarse and fine models. A distinguishing feature of our approach is its focus on continuous pose estimation, which enables more precise object manipulation, particularly concerning rotational angles. In addition, we employ pose and color augmentation techniques to enable effective training with limited data. Through extensive experiments in simulated and real-world scenarios, as well as an ablation study, we comprehensively evaluate our proposed methodology. Taken together, the findings validate its effectiveness in achieving high-precision pick-and-place tasks.

[CV-81] MFCLIP: Multi-modal Fine-grained CLIP for Generalizable Diffusion Face Forgery Detection

链接: https://arxiv.org/abs/2409.09724
作者: Yaning Zhang,Tianyi Wang,Zitong Yu,Zan Gao,Linlin Shen,Shengyong Chen
关键词-EN: raised significant concerns, photo-realistic face generation, face forgery detection, society and academia, highlighting the urgent
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The rapid development of photo-realistic face generation methods has raised significant concerns in society and academia, highlighting the urgent need for robust and generalizable face forgery detection (FFD) techniques. Although existing approaches mainly capture face forgery patterns using image modality, other modalities like fine-grained noises and texts are not fully explored, which limits the generalization capability of the model. In addition, most FFD methods tend to identify facial images generated by GAN, but struggle to detect unseen diffusion-synthesized ones. To address the limitations, we aim to leverage the cutting-edge foundation model, contrastive language-image pre-training (CLIP), to achieve generalizable diffusion face forgery detection (DFFD). In this paper, we propose a novel multi-modal fine-grained CLIP (MFCLIP) model, which mines comprehensive and fine-grained forgery traces across image-noise modalities via language-guided face forgery representation learning, to facilitate the advancement of DFFD. Specifically, we devise a fine-grained language encoder (FLE) that extracts fine global language features from hierarchical text prompts. We design a multi-modal vision encoder (MVE) to capture global image forgery embeddings as well as fine-grained noise forgery patterns extracted from the richest patch, and integrate them to mine general visual forgery traces. Moreover, we build an innovative plug-and-play sample pair attention (SPA) method to emphasize relevant negative pairs and suppress irrelevant ones, allowing cross-modality sample pairs to conduct more flexible alignment. Extensive experiments and visualizations show that our model outperforms the state of the arts on different settings like cross-generator, cross-forgery, and cross-dataset evaluations.

[CV-82] Finetuning CLIP to Reason about Pairwise Differences

链接: https://arxiv.org/abs/2409.09721
作者: Dylan Sam,Devin Willmott,Joao D. Semedo,J. Zico Kolter
关键词-EN: embedding space, Vision-language models, resulting embedding space, CLIP, embedding
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages

点击查看摘要

Abstract:Vision-language models (VLMs) such as CLIP are trained via contrastive learning between text and image pairs, resulting in aligned image and text embeddings that are useful for many downstream tasks. A notable drawback of CLIP, however, is that the resulting embedding space seems to lack some of the structure of their purely text-based alternatives. For instance, while text embeddings have been long noted to satisfy \emphanalogies in embedding space using vector arithmetic, CLIP has no such property. In this paper, we propose an approach to natively train CLIP in a contrastive manner to reason about differences in embedding space. We finetune CLIP so that the differences in image embedding space correspond to \emphtext descriptions of the image differences, which we synthetically generate with large language models on image-caption paired datasets. We first demonstrate that our approach yields significantly improved capabilities in ranking images by a certain attribute (e.g., elephants are larger than cats), which is useful in retrieval or constructing attribute-based classifiers, and improved zeroshot classification performance on many downstream image classification tasks. In addition, our approach enables a new mechanism for inference that we refer to as comparative prompting, where we leverage prior knowledge of text descriptions of differences between classes of interest, achieving even larger performance gains in classification. Finally, we illustrate that the resulting embeddings obey a larger degree of geometric properties in embedding space, such as in text-to-image generation.

[CV-83] Disentangling Visual Priors: Unsupervised Learning of Scene Interpretations with Compositional Autoencoder

链接: https://arxiv.org/abs/2409.09716
作者: Krzysztof Krawiec,Antoni Nowinowski
关键词-EN: Contemporary deep learning, fundamental visual concepts, handling fundamental visual, deep learning architectures, learning architectures lack
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Contemporary deep learning architectures lack principled means for capturing and handling fundamental visual concepts, like objects, shapes, geometric transforms, and other higher-level structures. We propose a neurosymbolic architecture that uses a domain-specific language to capture selected priors of image formation, including object shape, appearance, categorization, and geometric transforms. We express template programs in that language and learn their parameterization with features extracted from the scene by a convolutional neural network. When executed, the parameterized program produces geometric primitives which are rendered and assessed for correspondence with the scene content and trained via auto-association with gradient. We confront our approach with a baseline method on a synthetic benchmark and demonstrate its capacity to disentangle selected aspects of the image formation process, learn from small data, correct inference in the presence of noise, and out-of-sample generalization.

[CV-84] Pre-Training for 3D Hand Pose Estimation with Contrastive Learning on Large-Scale Hand Images in the Wild ECCV24

链接: https://arxiv.org/abs/2409.09714
作者: Nie Lin,Takehiko Ohkawa,Mingfang Zhang,Yifei Huang,Ryosuke Furuta,Yoichi Sato
关键词-EN: hand pose estimators, learning framework based, dubbed HandCLR, hand images, hand images tailored
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: HANDS@ECCV24 (Extended Abstracts)

点击查看摘要

Abstract:We present a contrastive learning framework based on in-the-wild hand images tailored for pre-training 3D hand pose estimators, dubbed HandCLR. Pre-training on large-scale images achieves promising results in various tasks, but prior 3D hand pose pre-training methods have not fully utilized the potential of diverse hand images accessible from in-the-wild videos. To facilitate scalable pre-training, we first prepare an extensive pool of hand images from in-the-wild videos and design our method with contrastive learning. Specifically, we collected over 2.0M hand images from recent human-centric videos, such as 100DOH and Ego4D. To extract discriminative information from these images, we focus on the similarity of hands; pairs of similar hand poses originating from different samples, and propose a novel contrastive learning method that embeds similar hand pairs closer in the latent space. Our experiments demonstrate that our method outperforms conventional contrastive learning approaches that produce positive pairs sorely from a single image with data augmentation. We achieve significant improvements over the state-of-the-art method in various datasets, with gains of 15% on FreiHand, 10% on DexYCB, and 4% on AssemblyHands.

[CV-85] ELSA: Exploiting Layer-wise N:M Sparsity for Vision Transformer Acceleration

链接: https://arxiv.org/abs/2409.09708
作者: Ning-Chi Huang,Chi-Chih Chang,Wei-Cheng Lin,Endri Taka,Diana Marculescu,Kai-Chiang Wu
关键词-EN: deep neural networks, sparse matrix multiplication, emerging model compression, neural networks, matrix multiplication
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract: N:M sparsity is an emerging model compression method supported by more and more accelerators to speed up sparse matrix multiplication in deep neural networks. Most existing N:M sparsity methods compress neural networks with a uniform setting for all layers in a network or heuristically determine the layer-wise configuration by considering the number of parameters in each layer. However, very few methods have been designed for obtaining a layer-wise customized N:M sparse configuration for vision transformers (ViTs), which usually consist of transformer blocks involving the same number of parameters. In this work, to address the challenge of selecting suitable sparse configuration for ViTs on N:M sparsity-supporting accelerators, we propose ELSA, Exploiting Layer-wise N:M Sparsity for ViTs. Considering not only all N:M sparsity levels supported by a given accelerator but also the expected throughput improvement, our methodology can reap the benefits of accelerators supporting mixed sparsity by trading off negligible accuracy loss with both memory usage and inference time reduction for ViT models. For instance, our approach achieves a noteworthy 2.9 \times reduction in FLOPs for both Swin-B and DeiT-B with only a marginal degradation of accuracy on ImageNet. Our code will be released upon paper acceptance.

[CV-86] Synergistic Spotting and Recognition of Micro-Expression via Temporal State Transition

链接: https://arxiv.org/abs/2409.09707
作者: Bochao Zou,Zizheng Guo,Wenfeng Qin,Xin Li,Kangsheng Wang,Huimin Ma
关键词-EN: conveying subtle cues, substantial real-world applications, involuntary facial movements, consciously controlled, conveying subtle
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Micro-expressions are involuntary facial movements that cannot be consciously controlled, conveying subtle cues with substantial real-world applications. The analysis of micro-expressions generally involves two main tasks: spotting micro-expression intervals in long videos and recognizing the emotions associated with these intervals. Previous deep learning methods have primarily relied on classification networks utilizing sliding windows. However, fixed window sizes and window-level hard classification introduce numerous constraints. Additionally, these methods have not fully exploited the potential of complementary pathways for spotting and recognition. In this paper, we present a novel temporal state transition architecture grounded in the state space model, which replaces conventional window-level classification with video-level regression. Furthermore, by leveraging the inherent connections between spotting and recognition tasks, we propose a synergistic strategy that enhances overall analysis performance. Extensive experiments demonstrate that our method achieves state-of-the-art performance. The codes and pre-trained models are available at this https URL.

[CV-87] E-Commerce Inpainting with Mask Guidance in Controlnet for Reducing Overcompletion

链接: https://arxiv.org/abs/2409.09681
作者: Guandong Li
关键词-EN: E-commerce image generation, E-commerce image, e-commerce field, E-commerce, product
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:E-commerce image generation has always been one of the core demands in the e-commerce field. The goal is to restore the missing background that matches the main product given. In the post-AIGC era, diffusion models are primarily used to generate product images, achieving impressive results. This paper systematically analyzes and addresses a core pain point in diffusion model generation: overcompletion, which refers to the difficulty in maintaining product features. We propose two solutions: 1. Using an instance mask fine-tuned inpainting model to mitigate this phenomenon; 2. Adopting a train-free mask guidance approach, which incorporates refined product masks as constraints when combining ControlNet and UNet to generate the main product, thereby avoiding overcompletion of the product. Our method has achieved promising results in practical applications and we hope it can serve as an inspiring technical report in this field.

[CV-88] A Comprehensive Methodological Survey of Human Activity Recognition Across Divers Data Modalities

链接: https://arxiv.org/abs/2409.09678
作者: Jungpil Shin,Najmul Hassan,Abu Saleh Musa Miah1,Satoshi Nishimura
关键词-EN: Human Activity Recognition, understand human behaviour, attracting significant attention, computer vision due, understand human
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Human Activity Recognition (HAR) systems aim to understand human behaviour and assign a label to each action, attracting significant attention in computer vision due to their wide range of applications. HAR can leverage various data modalities, such as RGB images and video, skeleton, depth, infrared, point cloud, event stream, audio, acceleration, and radar signals. Each modality provides unique and complementary information suited to different application scenarios. Consequently, numerous studies have investigated diverse approaches for HAR using these modalities. This paper presents a comprehensive survey of the latest advancements in HAR from 2014 to 2024, focusing on machine learning (ML) and deep learning (DL) approaches categorized by input data modalities. We review both single-modality and multi-modality techniques, highlighting fusion-based and co-learning frameworks. Additionally, we cover advancements in hand-crafted action features, methods for recognizing human-object interactions, and activity detection. Our survey includes a detailed dataset description for each modality and a summary of the latest HAR systems, offering comparative results on benchmark datasets. Finally, we provide insightful observations and propose effective future research directions in HAR.

[CV-89] SITSMamba for Crop Classification based on Satellite Image Time Series

链接: https://arxiv.org/abs/2409.09673
作者: Xiaolei Qin,Xin Su,Liangpei Zhang
关键词-EN: Satellite image time, image time series, time series, Time Series Mamba, SITS
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Satellite image time series (SITS) data provides continuous observations over time, allowing for the tracking of vegetation changes and growth patterns throughout the seasons and years. Numerous deep learning (DL) approaches using SITS for crop classification have emerged recently, with the latest approaches adopting Transformer for SITS classification. However, the quadratic complexity of self-attention in Transformer poses challenges for classifying long time series. While the cutting-edge Mamba architecture has demonstrated strength in various domains, including remote sensing image interpretation, its capacity to learn temporal representations in SITS data remains unexplored. Moreover, the existing SITS classification methods often depend solely on crop labels as supervision signals, which fails to fully exploit the temporal information. In this paper, we proposed a Satellite Image Time Series Mamba (SITSMamba) method for crop classification based on remote sensing time series data. The proposed SITSMamba contains a spatial encoder based on Convolutional Neural Networks (CNN) and a Mamba-based temporal encoder. To exploit richer temporal information from SITS, we design two branches of decoder used for different tasks. The first branch is a crop Classification Branch (CBranch), which includes a ConvBlock to decode the feature to a crop map. The second branch is a SITS Reconstruction Branch that uses a Linear layer to transform the encoded feature to predict the original input values. Furthermore, we design a Positional Weight (PW) applied to the RBranch to help the model learn rich latent knowledge from SITS. We also design two weighting factors to control the balance of the two branches during training. The code of SITSMamba is available at: this https URL.

[CV-90] Unsupervised Hyperspectral and Multispectral Image Blind Fusion Based on Deep Tucker Decomposition Network with Spatial-Spectral Manifold Learning

链接: https://arxiv.org/abs/2409.09670
作者: He Wang,Yang Xu,Zebin Wu,Zhihui Wei
关键词-EN: generate high spectral, resolution hyperspectral images, low-resolution hyperspectral images, high-resolution multispectral images, image fusion aims
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: Accepted by TNNLS 2024

点击查看摘要

Abstract:Hyperspectral and multispectral image fusion aims to generate high spectral and spatial resolution hyperspectral images (HR-HSI) by fusing high-resolution multispectral images (HR-MSI) and low-resolution hyperspectral images (LR-HSI). However, existing fusion methods encounter challenges such as unknown degradation parameters, incomplete exploitation of the correlation between high-dimensional structures and deep image features. To overcome these issues, in this article, an unsupervised blind fusion method for hyperspectral and multispectral images based on Tucker decomposition and spatial spectral manifold learning (DTDNML) is proposed. We design a novel deep Tucker decomposition network that maps LR-HSI and HR-MSI into a consistent feature space, achieving reconstruction through decoders with shared parameter. To better exploit and fuse spatial-spectral features in the data, we design a core tensor fusion network that incorporates a spatial spectral attention mechanism for aligning and fusing features at different scales. Furthermore, to enhance the capacity in capturing global information, a Laplacian-based spatial-spectral manifold constraints is introduced in shared-decoders. Sufficient experiments have validated that this method enhances the accuracy and efficiency of hyperspectral and multispectral fusion on different remote sensing datasets. The source code is available at this https URL.

[CV-91] EditBoard: Towards A Comprehensive Evaluation Benchmark for Text-based Video Editing Models

链接: https://arxiv.org/abs/2409.09668
作者: Yupeng Chen,Penglin Chen,Xiaoyu Zhang,Yixian Huang,Qian Xie
关键词-EN: advanced AI-generated content, significantly advanced AI-generated, AI-generated content, video editing, significantly advanced
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The rapid development of diffusion models has significantly advanced AI-generated content (AIGC), particularly in Text-to-Image (T2I) and Text-to-Video (T2V) generation. Text-based video editing, leveraging these generative capabilities, has emerged as a promising field, enabling precise modifications to videos based on text prompts. Despite the proliferation of innovative video editing models, there is a conspicuous lack of comprehensive evaluation benchmarks that holistically assess these models’ performance across various dimensions. Existing evaluations are limited and inconsistent, typically summarizing overall performance with a single score, which obscures models’ effectiveness on individual editing tasks. To address this gap, we propose EditBoard, the first comprehensive evaluation benchmark for text-based video editing models. EditBoard encompasses nine automatic metrics across four dimensions, evaluating models on four task categories and introducing three new metrics to assess fidelity. This task-oriented benchmark facilitates objective evaluation by detailing model performance and providing insights into each model’s strengths and weaknesses. By open-sourcing EditBoard, we aim to standardize evaluation and advance the development of robust video editing models.

[CV-92] SparX: A Sparse Cross-Layer Connection Mechanism for Hierarchical Vision Mamba and Transformer Networks

链接: https://arxiv.org/abs/2409.09649
作者: Meng Lou,Yunxiang Fu,Yizhou Yu
关键词-EN: dynamic state space, capturing long-range dependencies, Mamba has shown, NLP tasks, state space models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code will be publicly available at: this https URL

点击查看摘要

Abstract:Due to the capability of dynamic state space models (SSMs) in capturing long-range dependencies with near-linear computational complexity, Mamba has shown notable performance in NLP tasks. This has inspired the rapid development of Mamba-based vision models, resulting in promising results in visual recognition tasks. However, such models are not capable of distilling features across layers through feature aggregation, interaction, and selection. Moreover, existing cross-layer feature aggregation methods designed for CNNs or ViTs are not practical in Mamba-based models due to high computational costs. Therefore, this paper aims to introduce an efficient cross-layer feature aggregation mechanism for Mamba-based vision backbone networks. Inspired by the Retinal Ganglion Cells (RGCs) in the human visual system, we propose a new sparse cross-layer connection mechanism termed SparX to effectively improve cross-layer feature interaction and reuse. Specifically, we build two different types of network layers: ganglion layers and normal layers. The former has higher connectivity and complexity, enabling multi-layer feature aggregation and interaction in an input-dependent manner. In contrast, the latter has lower connectivity and complexity. By interleaving these two types of layers, we design a new vision backbone network with sparsely cross-connected layers, achieving an excellent trade-off among model size, computational cost, memory cost, and accuracy in comparison to its counterparts. For instance, with fewer parameters, SparX-Mamba-T improves the top-1 accuracy of VMamba-T from 82.5% to 83.5%, while SparX-Swin-T achieves a 1.3% increase in top-1 accuracy compared to Swin-T. Extensive experimental results demonstrate that our new connection mechanism possesses both superior performance and generalization capabilities on various vision tasks.

[CV-93] A Novel Framework For Text Detection From Natural Scene Images With Complex Background

链接: https://arxiv.org/abs/2409.09635
作者: Basavaraj Kaladagi,Jagadeesh Pujari
关键词-EN: Recognizing texts, hard problem, varied and complicated, Wavelet Transforms, Recognizing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Recognizing texts from camera images is a known hard problem because of the difficulties in text detection from the varied and complicated background. In this paper we propose a novel and efficient method to detect text region from images with complex background using Wavelet Transforms. The framework uses Wavelet Transformation of the original image in its grayscale form followed by Sub-band filtering. Then Region clustering technique is applied using centroids of the regions, further Bounding box is fitted to each region thus identifying the text regions. This method is much sophisticated and efficient than the previous methods as it doesn’t stick to a particular font size of the text thus, making it generalized. The sample set used for experimental purpose consists of 50 images with varying backgrounds. Images with edge prominence are considered. Furthermore, our method can be easily customized for applications with different scopes.

[CV-94] Can Large Language Models Grasp Event Signals? Exploring Pure Zero-Shot Event-based Recognition

链接: https://arxiv.org/abs/2409.09628
作者: Zongyou Yu,Qiang Qu,Xiaoming Chen,Chen Wang
关键词-EN: Recent advancements, demonstrated promising results, demonstrated promising, event-based, Recent
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in event-based zero-shot object recognition have demonstrated promising results. However, these methods heavily depend on extensive training and are inherently constrained by the characteristics of CLIP. To the best of our knowledge, this research is the first study to explore the understanding capabilities of large language models (LLMs) for event-based visual content. We demonstrate that LLMs can achieve event-based object recognition without additional training or fine-tuning in conjunction with CLIP, effectively enabling pure zero-shot event-based recognition. Particularly, we evaluate the ability of GPT-4o / 4turbo and two other open-source LLMs to directly recognize event-based visual content. Extensive experiments are conducted across three benchmark datasets, systematically assessing the recognition accuracy of these models. The results show that LLMs, especially when enhanced with well-designed prompts, significantly improve event-based zero-shot recognition performance. Notably, GPT-4o outperforms the compared models and exceeds the recognition accuracy of state-of-the-art event-based zero-shot methods on N-ImageNet by five orders of magnitude. The implementation of this paper is available at \urlthis https URL.

[CV-95] Enhancing Weakly-Supervised Object Detection on Static Images through (Hallucinated) Motion

链接: https://arxiv.org/abs/2409.09616
作者: Cagri Gungor,Adriana Kovashka
关键词-EN: images remains unexplored, weakly-supervised object detection, static images remains, remains unexplored, garnered attention
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:While motion has garnered attention in various tasks, its potential as a modality for weakly-supervised object detection (WSOD) in static images remains unexplored. Our study introduces an approach to enhance WSOD methods by integrating motion information. This method involves leveraging hallucinated motion from static images to improve WSOD on image datasets, utilizing a Siamese network for enhanced representation learning with motion, addressing camera motion through motion normalization, and selectively training images based on object motion. Experimental validation on the COCO and YouTube-BB datasets demonstrates improvements over a state-of-the-art method.

[CV-96] Integrating Audio Narrations to Strengthen Domain Generalization in Multimodal First-Person Action Recognition

链接: https://arxiv.org/abs/2409.09611
作者: Cagri Gungor,Adriana Kovashka
关键词-EN: First-person activity recognition, rapidly growing due, First-person activity, background scenes, rapidly growing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:First-person activity recognition is rapidly growing due to the widespread use of wearable cameras but faces challenges from domain shifts across different environments, such as varying objects or background scenes. We propose a multimodal framework that improves domain generalization by integrating motion, audio, and appearance features. Key contributions include analyzing the resilience of audio and motion features to domain shifts, using audio narrations for enhanced audio-text alignment, and applying consistency ratings between audio and visual narrations to optimize the impact of audio in recognition during training. Our approach achieves state-of-the-art performance on the ARGO1M dataset, effectively generalizing across unseen scenarios and locations.

[CV-97] xtureDiffusion: Target Prompt Disentangled Editing for Various Texture Transfer

链接: https://arxiv.org/abs/2409.09610
作者: Zihan Su,Junhao Zhuang,Chun Yuan
关键词-EN: achieved significant success, text-guided image editing, significant success, achieved significant, input image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, text-guided image editing has achieved significant success. However, existing methods can only apply simple textures like wood or gold when changing the texture of an object. Complex textures such as cloud or fire pose a challenge. This limitation stems from that the target prompt needs to contain both the input image content and texture, restricting the texture representation. In this paper, we propose TextureDiffusion, a tuning-free image editing method applied to various texture transfer. Initially, the target prompt is directly set to “texture”, making the texture disentangled from the input image content to enhance texture representation. Subsequently, query features in self-attention and features in residual blocks are utilized to preserve the structure of the input image. Finally, to maintain the background, we introduce an edit localization technique which blends the self-attention results and the intermediate latents. Comprehensive experiments demonstrate that TextureDiffusion can harmoniously transfer various textures with excellent structure and background preservation.

[CV-98] DreamMover: Leveraging the Prior of Diffusion Models for Image Interpolation with Large Motion ECCV2024

链接: https://arxiv.org/abs/2409.09605
作者: Liao Shen,Tianqi Liu,Huiqiang Sun,Xinyi Ye,Baopu Li,Jianming Zhang,Zhiguo Cao
关键词-EN: generating intermediate images, large motion, study the problem, problem of generating, maintaining semantic consistency
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024

点击查看摘要

Abstract:We study the problem of generating intermediate images from image pairs with large motion while maintaining semantic consistency. Due to the large motion, the intermediate semantic information may be absent in input images. Existing methods either limit to small motion or focus on topologically similar objects, leading to artifacts and inconsistency in the interpolation results. To overcome this challenge, we delve into pre-trained image diffusion models for their capabilities in semantic cognition and representations, ensuring consistent expression of the absent intermediate semantic representations with the input. To this end, we propose DreamMover, a novel image interpolation framework with three main components: 1) A natural flow estimator based on the diffusion model that can implicitly reason about the semantic correspondence between two images. 2) To avoid the loss of detailed information during fusion, our key insight is to fuse information in two parts, high-level space and low-level space. 3) To enhance the consistency between the generated images and input, we propose the self-attention concatenation and replacement approach. Lastly, we present a challenging benchmark dataset InterpBench to evaluate the semantic consistency of generated results. Extensive experiments demonstrate the effectiveness of our method. Our project is available at this https URL .

[CV-99] One-Shot Learning for Pose-Guided Person Image Synthesis in the Wild

链接: https://arxiv.org/abs/2409.09593
作者: Dongqi Fan,Tao Chen,Mingjie Wang,Rui Ma,Qiang Tang,Zili Yi,Qian Wang,Liang Chang
关键词-EN: Current Pose-Guided Person, Person Image Synthesis, labeled triplet data, Pose-Guided Person Image, Current Pose-Guided
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Current Pose-Guided Person Image Synthesis (PGPIS) methods depend heavily on large amounts of labeled triplet data to train the generator in a supervised manner. However, they often falter when applied to in-the-wild samples, primarily due to the distribution gap between the training datasets and real-world test samples. While some researchers aim to enhance model generalizability through sophisticated training procedures, advanced architectures, or by creating more diverse datasets, we adopt the test-time fine-tuning paradigm to customize a pre-trained Text2Image (T2I) model. However, naively applying test-time tuning results in inconsistencies in facial identities and appearance attributes. To address this, we introduce a Visual Consistency Module (VCM), which enhances appearance consistency by combining the face, text, and image embedding. Our approach, named OnePoseTrans, requires only a single source image to generate high-quality pose transfer results, offering greater stability than state-of-the-art data-driven methods. For each test case, OnePoseTrans customizes a model in around 48 seconds with an NVIDIA V100 GPU.

[CV-100] GLCONet: Learning Multi-source Perception Representation for Camouflaged Object Detection

链接: https://arxiv.org/abs/2409.09588
作者: Yanguang Sun,Hanyu Xuan,Jian Yang,Lei Luo
关键词-EN: camouflaged object detection, powerful tool, tool for handling, Collaborative Optimization Network, Recently
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at TNNLS 2024

点击查看摘要

Abstract:Recently, biological perception has been a powerful tool for handling the camouflaged object detection (COD) task. However, most existing methods are heavily dependent on the local spatial information of diverse scales from convolutional operations to optimize initial features. A commonly neglected point in these methods is the long-range dependencies between feature pixels from different scale spaces that can help the model build a global structure of the object, inducing a more precise image representation. In this paper, we propose a novel Global-Local Collaborative Optimization Network, called GLCONet. Technically, we first design a collaborative optimization strategy from the perspective of multi-source perception to simultaneously model the local details and global long-range relationships, which can provide features with abundant discriminative information to boost the accuracy in detecting camouflaged objects. Furthermore, we introduce an adjacent reverse decoder that contains cross-layer aggregation and reverse optimization to integrate complementary information from different levels for generating high-quality representations. Extensive experiments demonstrate that the proposed GLCONet method with different backbones can effectively activate potentially significant pixels in an image, outperforming twenty state-of-the-art methods on three public COD datasets. The source code is available at: \this https URL.

[CV-101] NEVLP: Noise-Robust Framework for Efficient Vision-Language Pre-training

链接: https://arxiv.org/abs/2409.09582
作者: Yiyi Tao,Zhuoyue Wang,Hang Zhang,Lun Wang
关键词-EN: Vision Language Models, success of Vision, Vision Language, scale web-crawled datasets, tasks heavily relies
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The success of Vision Language Models (VLMs) on various vision-language tasks heavily relies on pre-training with large scale web-crawled datasets. However, the noisy and incomplete nature of web data makes dataset scale crucial for performance, rendering end-to-end training increasingly prohibitive. In this paper, we propose NEVLP, a noise-robust framework for efficient vision-language pre-training that requires less pre-training data. Specifically, we bridge the modality gap between a frozen image encoder and a large language model with a transformer and introduce two innovative learning strategies: noise-adaptive learning and concept-enhanced learning to mitigate the impact of noise. In noise-adaptive learning, we estimate the noise probability of each image-text pair based on the transformer’s memorization effect and employ noise-adaptive regularization on image-text contrastive learning to condition cross-modal alignment. In concept-enhanced learning, we enrich incomplete text by incorporating visual concepts (objects in the image) to provide prior information about existing objects for image-text matching and image-grounded text generation, thereby mitigating text incompletion. Our framework effectively utilizes noisy web data and achieves state-of-the-art performance with less pre-training data across a wide range of vision-language tasks, including image-text retrieval, image captioning, and visual question answering.

[CV-102] Bias Begets Bias: The Impact of Biased Embeddings on Diffusion Models

链接: https://arxiv.org/abs/2409.09569
作者: Sahil Kuchlous,Marvin Li,Jeffrey G. Wang
关键词-EN: increased scrutiny, growing adoption, diffusion models, models, diffusion
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
*备注: 19 pages, 4 figures

点击查看摘要

Abstract:With the growing adoption of Text-to-Image (TTI) systems, the social biases of these models have come under increased scrutiny. Herein we conduct a systematic investigation of one such source of bias for diffusion models: embedding spaces. First, because traditional classifier-based fairness definitions require true labels not present in generative modeling, we propose statistical group fairness criteria based on a model’s internal representation of the world. Using these definitions, we demonstrate theoretically and empirically that an unbiased text embedding space for input prompts is a necessary condition for representationally balanced diffusion models, meaning the distribution of generated images satisfy diversity requirements with respect to protected attributes. Next, we investigate the impact of biased embeddings on evaluating the alignment between generated images and prompts, a process which is commonly used to assess diffusion models. We find that biased multimodal embeddings like CLIP can result in lower alignment scores for representationally balanced TTI models, thus rewarding unfair behavior. Finally, we develop a theoretical framework through which biases in alignment evaluation can be studied and propose bias mitigation methods. By specifically adapting the perspective of embedding spaces, we establish new fairness conditions for diffusion model development and evaluation.

[CV-103] Learning Transferable Features for Implicit Neural Representations

链接: https://arxiv.org/abs/2409.09566
作者: Kushal Vyas,Ahmed Imtiaz Humayun,Aniket Dashpute,Richard G. Baraniuk,Ashok Veeraraghavan,Guha Balakrishnan
关键词-EN: Implicit neural representations, Implicit neural, variety of applications, learned neural features, demonstrated success
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Implicit neural representations (INRs) have demonstrated success in a variety of applications, including inverse problems and neural rendering. An INR is typically trained to capture one signal of interest, resulting in learned neural features that are highly attuned to that signal. Assumed to be less generalizable, we explore the aspect of transferability of such learned neural features for fitting similar signals. We introduce a new INR training framework, STRAINER that learns transferrable features for fitting INRs to new signals from a given distribution, faster and with better reconstruction quality. Owing to the sequential layer-wise affine operations in an INR, we propose to learn transferable representations by sharing initial encoder layers across multiple INRs with independent decoder layers. At test time, the learned encoder representations are transferred as initialization for an otherwise randomly initialized INR. We find STRAINER to yield extremely powerful initialization for fitting images from the same domain and allow for \approx +10dB gain in signal quality early on compared to an untrained INR itself. STRAINER also provides a simple way to encode data-driven priors in INRs. We evaluate STRAINER on multiple in-domain and out-of-domain signal fitting tasks and inverse problems and further provide detailed analysis and discussion on the transferability of STRAINER’s features. Our demo can be accessed at this https URL .

[CV-104] G-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings

链接: https://arxiv.org/abs/2409.09564
作者: Dawei Yan,Pengcheng Li,Yang Li,Hao Chen,Qingguo Chen,Weihua Luo,Wei Dong,Qingsen Yan,Haokui Zhang,Chunhua Shen
关键词-EN: achieved promising results, vision encoder, success of vision-language, increasing number, number of researchers
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Currently, inspired by the success of vision-language models (VLMs), an increasing number of researchers are focusing on improving VLMs and have achieved promising results. However, most existing methods concentrate on optimizing the connector and enhancing the language model component, while neglecting improvements to the vision encoder itself. In contrast, we propose Text Guided LLaVA (TG-LLaVA) in this paper, which optimizes VLMs by guiding the vision encoder with text, offering a new and orthogonal optimization direction. Specifically, inspired by the purpose-driven logic inherent in human behavior, we use learnable latent embeddings as a bridge to analyze textual instruction and add the analysis results to the vision encoder as guidance, refining it. Subsequently, another set of latent embeddings extracts additional detailed text-guided information from high-resolution local patches as auxiliary information. Finally, with the guidance of text, the vision encoder can extract text-related features, similar to how humans focus on the most relevant parts of an image when considering a question. This results in generating better answers. Experiments on various datasets validate the effectiveness of the proposed method. Remarkably, without the need for additional training data, our propsoed method can bring more benefits to the baseline (LLaVA-1.5) compared with other concurrent methods. Furthermore, the proposed method consistently brings improvement in different settings.

[CV-105] Evaluating authenticity and quality of image captions via sentiment and semantic analyses

链接: https://arxiv.org/abs/2409.09560
作者: Aleksei Krotov,Alison Tebo,Dylan K. Picart,Aaron Dean Algave
关键词-EN: natural language processing, relies heavily, growth of deep, heavily on huge, huge amounts
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growth of deep learning (DL) relies heavily on huge amounts of labelled data for tasks such as natural language processing and computer vision. Specifically, in image-to-text or image-to-image pipelines, opinion (sentiment) may be inadvertently learned by a model from human-generated image captions. Additionally, learning may be affected by the variety and diversity of the provided captions. While labelling large datasets has largely relied on crowd-sourcing or data-worker pools, evaluating the quality of such training data is crucial. This study proposes an evaluation method focused on sentiment and semantic richness. That method was applied to the COCO-MS dataset, comprising approximately 150K images with segmented objects and corresponding crowd-sourced captions. We employed pre-trained models (Twitter-RoBERTa-base and BERT-base) to extract sentiment scores and variability of semantic embeddings from captions. The relation of the sentiment score and semantic variability with object categories was examined using multiple linear regression. Results indicate that while most captions were neutral, about 6% of the captions exhibited strong sentiment influenced by specific object categories. Semantic variability of within-image captions remained low and uncorrelated with object categories. Model-generated captions showed less than 1.5% of strong sentiment which was not influenced by object categories and did not correlate with the sentiment of the respective human-generated captions. This research demonstrates an approach to assess the quality of crowd- or worker-sourced captions informed by image content. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2409.09560 [cs.CV] (or arXiv:2409.09560v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.09560 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-106] Enhancing Printed Circuit Board Defect Detection through Ensemble Learning

链接: https://arxiv.org/abs/2409.09555
作者: Ka Nam Canaan Law,Mingshuo Yu,Lianglei Zhang,Yiyi Zhang,Peng Xu,Jerry Gao,Jun Liu
关键词-EN: printed circuit boards, electronic device technology, advancing electronic device, circuit boards, device technology
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The quality control of printed circuit boards (PCBs) is paramount in advancing electronic device technology. While numerous machine learning methodologies have been utilized to augment defect detection efficiency and accuracy, previous studies have predominantly focused on optimizing individual models for specific defect types, often overlooking the potential synergies between different approaches. This paper introduces a comprehensive inspection framework leveraging an ensemble learning strategy to address this gap. Initially, we utilize four distinct PCB defect detection models utilizing state-of-the-art methods: EfficientDet, MobileNet SSDv2, Faster RCNN, and YOLOv5. Each method is capable of identifying PCB defects independently. Subsequently, we integrate these models into an ensemble learning framework to enhance detection performance. A comparative analysis reveals that our ensemble learning framework significantly outperforms individual methods, achieving a 95% accuracy in detecting diverse PCB defects. These findings underscore the efficacy of our proposed ensemble learning framework in enhancing PCB quality control processes.

[CV-107] An Augmentation-based Model Re-adaptation Framework for Robust Image Segmentation ECCV

链接: https://arxiv.org/abs/2409.09530
作者: Zheming Zuo,Joseph Smith,Jonathan Stonehouse,Boguslaw Obara
关键词-EN: computer vision, Image segmentation, crucial task, task in computer, applications in industry
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in the European Conference on Computer Vision (ECCV) 2024 workshop

点击查看摘要

Abstract:Image segmentation is a crucial task in computer vision, with wide-ranging applications in industry. The Segment Anything Model (SAM) has recently attracted intensive attention; however, its application in industrial inspection, particularly for segmenting commercial anti-counterfeit codes, remains challenging. Unlike open-source datasets, industrial settings often face issues such as small sample sizes and complex textures. Additionally, computational cost is a key concern due to the varying number of trainable parameters. To address these challenges, we propose an Augmentation-based Model Re-adaptation Framework (AMRF). This framework leverages data augmentation techniques during training to enhance the generalisation of segmentation models, allowing them to adapt to newly released datasets with temporal disparity. By observing segmentation masks from conventional models (FCN and U-Net) and a pre-trained SAM model, we determine a minimal augmentation set that optimally balances training efficiency and model performance. Our results demonstrate that the fine-tuned FCN surpasses its baseline by 3.29% and 3.02% in cropping accuracy, and 5.27% and 4.04% in classification accuracy on two temporally continuous datasets. Similarly, the fine-tuned U-Net improves upon its baseline by 7.34% and 4.94% in cropping, and 8.02% and 5.52% in classification. Both models outperform the top-performing SAM models (ViT-Large and ViT-Base) by an average of 11.75% and 9.01% in cropping accuracy, and 2.93% and 4.83% in classification accuracy, respectively.

[CV-108] Enhancing Skin Disease Diagnosis: Interpretable Visual Concept Discovery with SAM Empowerment

链接: https://arxiv.org/abs/2409.09520
作者: Xin Hu,Janet Wang,Jihun Hamm,Rie R Yotsu,Zhengming Ding
关键词-EN: deep learning architectures, Current AI-assisted skin, Current AI-assisted, achieved dermatologist-level performance, classifying skin cancer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Current AI-assisted skin image diagnosis has achieved dermatologist-level performance in classifying skin cancer, driven by rapid advancements in deep learning architectures. However, unlike traditional vision tasks, skin images in general present unique challenges due to the limited availability of well-annotated datasets, complex variations in conditions, and the necessity for detailed interpretations to ensure patient safety. Previous segmentation methods have sought to reduce image noise and enhance diagnostic performance, but these techniques require fine-grained, pixel-level ground truth masks for training. In contrast, with the rise of foundation models, the Segment Anything Model (SAM) has been introduced to facilitate promptable segmentation, enabling the automation of the segmentation process with simple yet effective prompts. Efforts applying SAM predominantly focus on dermatoscopy images, which present more easily identifiable lesion boundaries than clinical photos taken with smartphones. This limitation constrains the practicality of these approaches to real-world applications. To overcome the challenges posed by noisy clinical photos acquired via non-standardized protocols and to improve diagnostic accessibility, we propose a novel Cross-Attentive Fusion framework for interpretable skin lesion diagnosis. Our method leverages SAM to generate visual concepts for skin diseases using prompts, integrating local visual concepts with global image features to enhance model performance. Extensive evaluation on two skin disease datasets demonstrates our proposed method’s effectiveness on lesion diagnosis and interpretability.

[CV-109] One missing piece in Vision and Language: A Survey on Comics Understanding

链接: https://arxiv.org/abs/2409.09502
作者: Emanuele Vivoli,Andrey Barsky,Mohamed Ali Souibgui,Artemis LLabres,Marco Bertini,Dimosthenis Karatzas
关键词-EN: versatile systems capable, visual question answering, Comics Understanding, Comics, question answering
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: under review. project website: this https URL

点击查看摘要

Abstract:Vision-language models have recently evolved into versatile systems capable of high performance across a range of tasks, such as document understanding, visual question answering, and grounding, often in zero-shot settings. Comics Understanding, a complex and multifaceted field, stands to greatly benefit from these advances. Comics, as a medium, combine rich visual and textual narratives, challenging AI models with tasks that span image classification, object detection, instance segmentation, and deeper narrative comprehension through sequential panels. However, the unique structure of comics – characterized by creative variations in style, reading order, and non-linear storytelling – presents a set of challenges distinct from those in other visual-language domains. In this survey, we present a comprehensive review of Comics Understanding from both dataset and task perspectives. Our contributions are fivefold: (1) We analyze the structure of the comics medium, detailing its distinctive compositional elements; (2) We survey the widely used datasets and tasks in comics research, emphasizing their role in advancing the field; (3) We introduce the Layer of Comics Understanding (LoCU) framework, a novel taxonomy that redefines vision-language tasks within comics and lays the foundation for future work; (4) We provide a detailed review and categorization of existing methods following the LoCU framework; (5) Finally, we highlight current research challenges and propose directions for future exploration, particularly in the context of vision-language models applied to comics. This survey is the first to propose a task-oriented framework for comics intelligence and aims to guide future research by addressing critical gaps in data availability and task definition. A project associated with this survey is available at this https URL.

[CV-110] Multi-Scale Grouped Prototypes for Interpretable Semantic Segmentation

链接: https://arxiv.org/abs/2409.09497
作者: Hugo Porta,Emanuele Dalsasso,Diego Marcos,Devis Tuia
关键词-EN: Prototypical part learning, making semantic segmentation, promising approach, approach for making, Prototypical part
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Prototypical part learning is emerging as a promising approach for making semantic segmentation interpretable. The model selects real patches seen during training as prototypes and constructs the dense prediction map based on the similarity between parts of the test image and the prototypes. This improves interpretability since the user can inspect the link between the predicted output and the patterns learned by the model in terms of prototypical information. In this paper, we propose a method for interpretable semantic segmentation that leverages multi-scale image representation for prototypical part learning. First, we introduce a prototype layer that explicitly learns diverse prototypical parts at several scales, leading to multi-scale representations in the prototype activation output. Then, we propose a sparse grouping mechanism that produces multi-scale sparse groups of these scale-specific prototypical parts. This provides a deeper understanding of the interactions between multi-scale object representations while enhancing the interpretability of the segmentation model. The experiments conducted on Pascal VOC, Cityscapes, and ADE20K demonstrate that the proposed method increases model sparsity, improves interpretability over existing prototype-based methods, and narrows the performance gap with the non-interpretable counterpart models. Code is available at this http URL.

[CV-111] MAC-VO: Metrics-aware Covariance for Learning-based Stereo Visual Odometry

链接: https://arxiv.org/abs/2409.09479
作者: Yuheng Qiu,Yutian Chen,Zihao Zhang,Wenshan Wang,Sebastian Scherer
关键词-EN: learned metrics-aware matching, metrics-aware matching uncertainty, pose graph optimization, dual purposes, weighing the residual
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose the MAC-VO, a novel learning-based stereo VO that leverages the learned metrics-aware matching uncertainty for dual purposes: selecting keypoint and weighing the residual in pose graph optimization. Compared to traditional geometric methods prioritizing texture-affluent features like edges, our keypoint selector employs the learned uncertainty to filter out the low-quality features based on global inconsistency. In contrast to the learning-based algorithms that model the scale-agnostic diagonal weight matrix for covariance, we design a metrics-aware covariance model to capture the spatial error during keypoint registration and the correlations between different axes. Integrating this covariance model into pose graph optimization enhances the robustness and reliability of pose estimation, particularly in challenging environments with varying illumination, feature density, and motion patterns. On public benchmark datasets, MAC-VO outperforms existing VO algorithms and even some SLAM algorithms in challenging environments. The covariance map also provides valuable information about the reliability of the estimated poses, which can benefit decision-making for autonomous systems.

[CV-112] Learning Keypoints for Multi-Agent Behavior Analysis using Self-Supervision

链接: https://arxiv.org/abs/2409.09455
作者: Daniel Khalil,Christina Liu,Pietro Perona,Jennifer J. Sun,Markus Marks
关键词-EN: crucial in biology, study of social, social interactions, interactions and collective, keypoint
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The study of social interactions and collective behaviors through multi-agent video analysis is crucial in biology. While self-supervised keypoint discovery has emerged as a promising solution to reduce the need for manual keypoint annotations, existing methods often struggle with videos containing multiple interacting agents, especially those of the same species and color. To address this, we introduce B-KinD-multi, a novel approach that leverages pre-trained video segmentation models to guide keypoint discovery in multi-agent scenarios. This eliminates the need for time-consuming manual annotations on new experimental settings and organisms. Extensive evaluations demonstrate improved keypoint regression and downstream behavioral classification in videos of flies, mice, and rats. Furthermore, our method generalizes well to other species, including ants, bees, and humans, highlighting its potential for broad applications in automated keypoint annotation for multi-agent behavior analysis. Code available under: this https URL

[CV-113] On the Generalizability of Foundation Models for Crop Type Mapping

链接: https://arxiv.org/abs/2409.09451
作者: Yi-Chia Chang,Adam J. Stewart,Favyen Bastani,Piper Wolters,Shreya Kannan,George R. Huber,Jingtong Wang,Arindam Banerjee
关键词-EN: including language understanding, Foundation models pre-trained, shown powerful transfer, Foundation models, text generation
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Foundation models pre-trained using self-supervised and weakly-supervised learning have shown powerful transfer learning capabilities on various downstream tasks, including language understanding, text generation, and image recognition. Recently, the Earth observation (EO) field has produced several foundation models pre-trained directly on multispectral satellite imagery (e.g., Sentinel-2) for applications like precision agriculture, wildfire and drought monitoring, and natural disaster response. However, few studies have investigated the ability of these models to generalize to new geographic locations, and potential concerns of geospatial bias – models trained on data-rich developed countries not transferring well to data-scarce developing countries – remain. We investigate the ability of popular EO foundation models to transfer to new geographic regions in the agricultural domain, where differences in farming practices and class imbalance make transfer learning particularly challenging. We first select six crop classification datasets across five continents, normalizing for dataset size and harmonizing classes to focus on four major cereal grains: maize, soybean, rice, and wheat. We then compare three popular foundation models, pre-trained on SSL4EO-S12, SatlasPretrain, and ImageNet, using in-distribution (ID) and out-of-distribution (OOD) evaluation. Experiments show that pre-trained weights designed explicitly for Sentinel-2, such as SSL4EO-S12, outperform general pre-trained weights like ImageNet. Furthermore, the benefits of pre-training on OOD data are the most significant when only 10–100 ID training samples are used. Transfer learning and pre-training with OOD and limited ID data show promising applications, as many developing regions have scarce crop type labels. All harmonized datasets and experimental code are open-source and available for download.

[CV-114] MulCPred: Learning Multi-modal Concepts for Explainable Pedestrian Action Prediction

链接: https://arxiv.org/abs/2409.09446
作者: Yan Feng,Alexander Carballo,Keisuke Fujii,Robin Karlsson,Ming Ding,Kazuya Takeda
关键词-EN: autonomous driving, great significance, Pedestrian action prediction, concepts, Pedestrian action
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Pedestrian action prediction is of great significance for many applications such as autonomous driving. However, state-of-the-art methods lack explainability to make trustworthy predictions. In this paper, a novel framework called MulCPred is proposed that explains its predictions based on multi-modal concepts represented by training samples. Previous concept-based methods have limitations including: 1) they cannot directly apply to multi-modal cases; 2) they lack locality to attend to details in the inputs; 3) they suffer from mode collapse. These limitations are tackled accordingly through the following approaches: 1) a linear aggregator to integrate the activation results of the concepts into predictions, which associates concepts of different modalities and provides ante-hoc explanations of the relevance between the concepts and the predictions; 2) a channel-wise recalibration module that attends to local spatiotemporal regions, which enables the concepts with locality; 3) a feature regularization loss that encourages the concepts to learn diverse patterns. MulCPred is evaluated on multiple datasets and tasks. Both qualitative and quantitative results demonstrate that MulCPred is promising in improving the explainability of pedestrian action prediction without obvious performance degradation. Furthermore, by removing unrecognizable concepts from MulCPred, the cross-dataset prediction performance is improved, indicating the feasibility of further generalizability of MulCPred.

[CV-115] KAN-HyperpointNet for Point Cloud Sequence-Based 3D Human Action Recognition

链接: https://arxiv.org/abs/2409.09444
作者: Zhaoyu Chen,Xing Li,Qian Huang,Qiang Geng,Tianjin Yang,Shihao Han
关键词-EN: Point cloud sequence-based, existing point cloud, Point cloud, achieved impressive performance, point cloud sequence
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Point cloud sequence-based 3D action recognition has achieved impressive performance and efficiency. However, existing point cloud sequence modeling methods cannot adequately balance the precision of limb micro-movements with the integrity of posture macro-structure, leading to the loss of crucial information cues in action inference. To overcome this limitation, we introduce D-Hyperpoint, a novel data type generated through a D-Hyperpoint Embedding module. D-Hyperpoint encapsulates both regional-momentary motion and global-static posture, effectively summarizing the unit human action at each moment. In addition, we present a D-Hyperpoint KANsMixer module, which is recursively applied to nested groupings of D-Hyperpoints to learn the action discrimination information and creatively integrates Kolmogorov-Arnold Networks (KAN) to enhance spatio-temporal interaction within D-Hyperpoints. Finally, we propose KAN-HyperpointNet, a spatio-temporal decoupled network architecture for 3D action recognition. Extensive experiments on two public datasets: MSR Action3D and NTU-RGB+D 60, demonstrate the state-of-the-art performance of our method.

[CV-116] Detecting Looted Archaeological Sites from Satellite Image Time Series

链接: https://arxiv.org/abs/2409.09432
作者: Elliot Vincent,Mehraïl Saroufim,Jonathan Chemla,Yves Ubelmann,Philippe Marquis,Jean Ponce,Mathieu Aubry
关键词-EN: past human activity, Afghan archaeological sites, DAFA Looted Sites, past societies, Archaeological sites
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Archaeological sites are the physical remains of past human activity and one of the main sources of information about past societies and cultures. However, they are also the target of malevolent human actions, especially in countries having experienced inner turmoil and conflicts. Because monitoring these sites from space is a key step towards their preservation, we introduce the DAFA Looted Sites dataset, \datasetname, a labeled multi-temporal remote sensing dataset containing 55,480 images acquired monthly over 8 years across 675 Afghan archaeological sites, including 135 sites looted during the acquisition period. \datasetname~is particularly challenging because of the limited number of training samples, the class imbalance, the weak binary annotations only available at the level of the time series, and the subtlety of relevant changes coupled with important irrelevant ones over a long time period. It is also an interesting playground to assess the performance of satellite image time series (SITS) classification methods on a real and important use case. We evaluate a large set of baselines, outline the substantial benefits of using foundation models and show the additional boost that can be provided by using complete time series instead of using a single image.

[CV-117] Evaluating Pre-trained Convolutional Neural Networks and Foundation Models as Feature Extractors for Content-based Medical Image Retrieval

链接: https://arxiv.org/abs/2409.09430
作者: Amirreza Mahbod,Nematollah Saeidi,Sepideh Hatamikia,Ramona Woitek
关键词-EN: Medical image retrieval, inexperienced medical practitioners, image retrieval refers, CBMIR, Medical image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 29 pages

点击查看摘要

Abstract:Medical image retrieval refers to the task of finding similar images for given query images in a database, with applications such as diagnosis support, treatment planning, and educational tools for inexperienced medical practitioners. While traditional medical image retrieval was performed using clinical metadata, content-based medical image retrieval (CBMIR) relies on the characteristic features of the images, such as color, texture, shape, and spatial features. Many approaches have been proposed for CBMIR, and among them, using pre-trained convolutional neural networks (CNNs) is a widely utilized approach. However, considering the recent advances in the development of foundation models for various computer vision tasks, their application for CBMIR can be also investigated for its potentially superior performance. In this study, we used several pre-trained feature extractors from well-known pre-trained CNNs (VGG19, ResNet-50, DenseNet121, and EfficientNetV2M) and pre-trained foundation models (MedCLIP, BioMedCLIP, OpenCLIP, CONCH and UNI) and investigated the CBMIR performance on a subset of the MedMNIST V2 dataset, including eight types of 2D and 3D medical images. Furthermore, we also investigated the effect of image size on the CBMIR performance. Our results show that, overall, for the 2D datasets, foundation models deliver superior performance by a large margin compared to CNNs, with UNI providing the best overall performance across all datasets and image sizes. For 3D datasets, CNNs and foundation models deliver more competitive performance, with CONCH achieving the best overall performance. Moreover, our findings confirm that while using larger image sizes (especially for 2D datasets) yields slightly better performance, competitive CBMIR performance can still be achieved even with smaller image sizes. Our codes to generate and reproduce the results are available on GitHub. Comments: 29 pages Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.09430 [cs.CV] (or arXiv:2409.09430v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.09430 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-118] NBBOX: Noisy Bounding Box Improves Remote Sensing Object Detection

链接: https://arxiv.org/abs/2409.09424
作者: Yechan Kim,SooYeon Kim,Moongu Jeon
关键词-EN: insufficient data, bounding box, significant advancements, advancements in computer, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data augmentation has seen significant advancements in computer vision to improve model performance over the years, particularly in scenarios with limited and insufficient data. Currently, most studies focus on adjusting the image or its features to expand the size, quality, and variety of samples during training in various tasks including object detection. However, we argue that it is necessary to investigate bounding box transformations as a model regularization technique rather than image-level transformations, especially in aerial imagery due to potentially inconsistent bounding box annotations. Hence, this letter presents a thorough investigation of bounding box transformation in terms of scaling, rotation, and translation for remote sensing object detection. We call this augmentation strategy NBBOX (Noise Injection into Bounding Box). We conduct extensive experiments on DOTA and DIOR-R, both well-known datasets that include a variety of rotated generic objects in aerial images. Experimental results show that our approach significantly improves remote sensing object detection without whistles and bells and it is more time-efficient than other state-of-the-art augmentation strategies.

[CV-119] Label Convergence: Defining an Upper Performance Bound in Object Recognition through Contradictory Annotations

链接: https://arxiv.org/abs/2409.09412
作者: David Tschirschwitz,Volker Rodehorst
关键词-EN: machine learning models, training of machine, machine learning, label convergence, Label
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Annotation errors are a challenge not only during training of machine learning models, but also during their evaluation. Label variations and inaccuracies in datasets often manifest as contradictory examples that deviate from established labeling conventions. Such inconsistencies, when significant, prevent models from achieving optimal performance on metrics such as mean Average Precision (mAP). We introduce the notion of “label convergence” to describe the highest achievable performance under the constraint of contradictory test annotations, essentially defining an upper bound on model accuracy. Recognizing that noise is an inherent characteristic of all data, our study analyzes five real-world datasets, including the LVIS dataset, to investigate the phenomenon of label convergence. We approximate that label convergence is between 62.63-67.52 mAP@[0.5:0.95:0.05] for LVIS with 95% confidence, attributing these bounds to the presence of real annotation errors. With current state-of-the-art (SOTA) models at the upper end of the label convergence interval for the well-studied LVIS dataset, we conclude that model capacity is sufficient to solve current object detection problems. Therefore, future efforts should focus on three key aspects: (1) updating the problem specification and adjusting evaluation practices to account for unavoidable label noise, (2) creating cleaner data, especially test data, and (3) including multi-annotated data to investigate annotation variation and make these issues visible from the outset. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.09412 [cs.CV] (or arXiv:2409.09412v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.09412 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-120] Real-world Adversarial Defense against Patch Attacks based on Diffusion Model

链接: https://arxiv.org/abs/2409.09406
作者: Xingxing Wei,Caixin Kang,Yinpeng Dong,Zhengyi Wang,Shouwei Ruan,Yubo Chen,Hang Su
关键词-EN: deep learning models, diffusion model, Adversarial patches present, present significant challenges, Adversarial Anomaly Perception
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adversarial patches present significant challenges to the robustness of deep learning models, making the development of effective defenses become critical for real-world applications. This paper introduces DIFFender, a novel DIFfusion-based DeFender framework that leverages the power of a text-guided diffusion model to counter adversarial patch attacks. At the core of our approach is the discovery of the Adversarial Anomaly Perception (AAP) phenomenon, which enables the diffusion model to accurately detect and locate adversarial patches by analyzing distributional anomalies. DIFFender seamlessly integrates the tasks of patch localization and restoration within a unified diffusion model framework, enhancing defense efficacy through their close interaction. Additionally, DIFFender employs an efficient few-shot prompt-tuning algorithm, facilitating the adaptation of the pre-trained diffusion model to defense tasks without the need for extensive retraining. Our comprehensive evaluation, covering image classification and face recognition tasks, as well as real-world scenarios, demonstrates DIFFender’s robust performance against adversarial attacks. The framework’s versatility and generalizability across various settings, classifiers, and attack methodologies mark a significant advancement in adversarial patch defense strategies. Except for the popular visible domain, we have identified another advantage of DIFFender: its capability to easily expand into the infrared domain. Consequently, we demonstrate the good flexibility of DIFFender, which can defend against both infrared and visible adversarial patch attacks alternatively using a universal defense framework.

[CV-121] AI-Driven Virtual Teacher for Enhanced Educational Efficiency: Leveraging Large Pretrain Models for Autonomous Error Analysis and Correction

链接: https://arxiv.org/abs/2409.09403
作者: Tianlong Xu,Yi-Fan Zhang,Zhendong Chu,Shen Wang,Qingsong Wen
关键词-EN: solving mathematical problems, frequently make mistakes, Students frequently make, textbf, mathematical problems
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Students frequently make mistakes while solving mathematical problems, and traditional error correction methods are both time-consuming and labor-intensive. This paper introduces an innovative \textbfVirtual \textbfAI \textbfTeacher system designed to autonomously analyze and correct student \textbfErrors (VATE). Leveraging advanced large language models (LLMs), the system uses student drafts as a primary source for error analysis, which enhances understanding of the student’s learning process. It incorporates sophisticated prompt engineering and maintains an error pool to reduce computational overhead. The AI-driven system also features a real-time dialogue component for efficient student interaction. Our approach demonstrates significant advantages over traditional and machine learning-based error correction methods, including reduced educational costs, high scalability, and superior generalizability. The system has been deployed on the Squirrel AI learning platform for elementary mathematics education, where it achieves 78.3% accuracy in error analysis and shows a marked improvement in student learning efficiency. Satisfaction surveys indicate a strong positive reception, highlighting the system’s potential to transform educational practices.

[CV-122] ran-GCN: A Transformer-Enhanced Graph Convolutional Network for Person Re-Identification in Monitoring Videos

链接: https://arxiv.org/abs/2409.09391
作者: Xiaobin Hong,Tarmizi Adam,Masitah Ghazali
关键词-EN: cross-camera pedestrian recognition, enabling cross-camera pedestrian, Graph Convolutional Network, Transformer-enhanced Graph Convolutional, Graph Convolutional Module
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Person Re-Identification (Re-ID) has gained popularity in computer vision, enabling cross-camera pedestrian recognition. Although the development of deep learning has provided a robust technical foundation for person Re-ID research, most existing person Re-ID methods overlook the potential relationships among local person features, failing to adequately address the impact of pedestrian pose variations and local body parts occlusion. Therefore, we propose a Transformer-enhanced Graph Convolutional Network (Tran-GCN) model to improve Person Re-Identification performance in monitoring videos. The model comprises four key components: (1) A Pose Estimation Learning branch is utilized to estimate pedestrian pose information and inherent skeletal structure data, extracting pedestrian key point information; (2) A Transformer learning branch learns the global dependencies between fine-grained and semantically meaningful local person features; (3) A Convolution learning branch uses the basic ResNet architecture to extract the person’s fine-grained local features; (4) A Graph Convolutional Module (GCM) integrates local feature information, global feature information, and body information for more effective person identification after fusion. Quantitative and qualitative analysis experiments conducted on three different datasets (Market-1501, DukeMTMC-ReID, and MSMT17) demonstrate that the Tran-GCN model can more accurately capture discriminative person features in monitoring videos, significantly improving identification accuracy.

[CV-123] AMBER – Advanced SegFormer for Multi-Band Image Segmentation: an application to Hyperspectral Imaging

链接: https://arxiv.org/abs/2409.09386
作者: Andrea Dosi,Massimo Brescia,Stefano Cavuoti,Mariarca D’Aniello,Michele Delli Veneri,Carlo Donadio,Adriano Ettari,Giuseppe Longo,Alvi Rownok,Luca Sannino,Maria Zampella
关键词-EN: Deep learning, enabling the extraction, learning has revolutionized, revolutionized the field, extraction of complex
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: submitted to Neural Computing Applications (Springer). Currently under review

点击查看摘要

Abstract:Deep learning has revolutionized the field of hyperspectral image (HSI) analysis, enabling the extraction of complex and hierarchical features. While convolutional neural networks (CNNs) have been the backbone of HSI classification, their limitations in capturing global contextual features have led to the exploration of Vision Transformers (ViTs). This paper introduces AMBER, an advanced SegFormer specifically designed for multi-band image segmentation. AMBER enhances the original SegFormer by incorporating three-dimensional convolutions to handle hyperspectral data. Our experiments, conducted on the Indian Pines, Pavia University, and PRISMA datasets, show that AMBER outperforms traditional CNN-based methods in terms of Overall Accuracy, Kappa coefficient, and Average Accuracy on the first two datasets, and achieves state-of-the-art performance on the PRISMA dataset.

[CV-124] Interpretable Vision-Language Survival Analysis with Ordinal Inductive Bias for Computational Pathology

链接: https://arxiv.org/abs/2409.09369
作者: Pei Liu,Luping Ji,Jiaxiang Gou,Bo Fu,Mao Ye
关键词-EN: Histopathology Whole-Slide Images, assess cancer prognosis, Histopathology Whole-Slide, Whole-Slide Images, tool to assess
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 24 pages, 11 tables, 6 figures

点击查看摘要

Abstract:Histopathology Whole-Slide Images (WSIs) provide an important tool to assess cancer prognosis in computational pathology (CPATH). While existing survival analysis (SA) approaches have made exciting progress, they are generally limited to adopting highly-expressive architectures and only coarse-grained patient-level labels to learn prognostic visual representations from gigapixel WSIs. Such learning paradigm suffers from important performance bottlenecks, when facing present scarce training data and standard multi-instance learning (MIL) framework in CPATH. To break through it, this paper, for the first time, proposes a new Vision-Language-based SA (VLSA) paradigm. Concretely, (1) VLSA is driven by pathology VL foundation models. It no longer relies on high-capability networks and shows the advantage of data efficiency. (2) In vision-end, VLSA encodes prognostic language prior and then employs it as auxiliary signals to guide the aggregating of prognostic visual features at instance level, thereby compensating for the weak supervision in MIL. Moreover, given the characteristics of SA, we propose i) ordinal survival prompt learning to transform continuous survival labels into textual prompts; and ii) ordinal incidence function as prediction target to make SA compatible with VL-based prediction. VLSA’s predictions can be interpreted intuitively by our Shapley values-based method. The extensive experiments on five datasets confirm the effectiveness of our scheme. Our VLSA could pave a new way for SA in CPATH by offering weakly-supervised MIL an effective means to learn valuable prognostic clues from gigapixel WSIs. Our source code is available at this https URL.

[CV-125] MHAD: Multimodal Home Activity Dataset with Multi-Angle Videos and Synchronized Physiological Signals

链接: https://arxiv.org/abs/2409.09366
作者: Lei Yu,Jintao Fei,Xinyi Liu,Yang Yao,Jun Zhao,Guoxin Wang,Xin Li
关键词-EN: exemplified by remote, remote photoplethysmography, pulse and respiration, respiration by analyzing, analyzing subtle
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Video-based physiology, exemplified by remote photoplethysmography (rPPG), extracts physiological signals such as pulse and respiration by analyzing subtle changes in video recordings. This non-contact, real-time monitoring method holds great potential for home settings. Despite the valuable contributions of public benchmark datasets to this technology, there is currently no dataset specifically designed for passive home monitoring. Existing datasets are often limited to close-up, static, frontal recordings and typically include only 1-2 physiological signals. To advance video-based physiology in real home settings, we introduce the MHAD dataset. It comprises 1,440 videos from 40 subjects, capturing 6 typical activities from 3 angles in a real home environment. Additionally, 5 physiological signals were recorded, making it a comprehensive video-based physiology dataset. MHAD is compatible with the rPPG-toolbox and has been validated using several unsupervised and supervised methods. Our dataset is publicly available at this https URL.

[CV-126] Beta-Sigma VAE: Separating beta and decoder variance in Gaussian variational autoencoder ICPR2024

链接: https://arxiv.org/abs/2409.09361
作者: Seunghwan Kim,Seungkyu Lee
关键词-EN: Variational autoencoder, established generative model, established generative, beta, VAE
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: Accepted for ICPR 2024

点击查看摘要

Abstract:Variational autoencoder (VAE) is an established generative model but is notorious for its blurriness. In this work, we investigate the blurry output problem of VAE and resolve it, exploiting the variance of Gaussian decoder and \beta of beta-VAE. Specifically, we reveal that the indistinguishability of decoder variance and \beta hinders appropriate analysis of the model by random likelihood value, and limits performance improvement by omitting the gain from \beta . To address the problem, we propose Beta-Sigma VAE (BS-VAE) that explicitly separates \beta and decoder variance \sigma^2_x in the model. Our method demonstrates not only superior performance in natural image synthesis but also controllable parameters and predictable analysis compared to conventional VAE. In our experimental evaluation, we employ the analysis of rate-distortion curve and proxy metrics on computer vision datasets. The code is available on this https URL

[CV-127] LACOSTE: Exploiting stereo and temporal contexts for surgical instrument segmentation

链接: https://arxiv.org/abs/2409.09360
作者: Qiyuan Wang,Shang Zhao,Zikang Xu,S Kevin Zhou
关键词-EN: minimally invasive surgeries, Surgical instrument segmentation, related applications, instrumental to minimally, minimally invasive
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Preprint submitted to Medical Image Analysis

点击查看摘要

Abstract:Surgical instrument segmentation is instrumental to minimally invasive surgeries and related applications. Most previous methods formulate this task as single-frame-based instance segmentation while ignoring the natural temporal and stereo attributes of a surgical video. As a result, these methods are less robust against the appearance variation through temporal motion and view change. In this work, we propose a novel LACOSTE model that exploits Location-Agnostic COntexts in Stereo and TEmporal images for improved surgical instrument segmentation. Leveraging a query-based segmentation model as core, we design three performance-enhancing modules. Firstly, we design a disparity-guided feature propagation module to enhance depth-aware features explicitly. To generalize well for even only a monocular video, we apply a pseudo stereo scheme to generate complementary right images. Secondly, we propose a stereo-temporal set classifier, which aggregates stereo-temporal contexts in a universal way for making a consolidated prediction and mitigates transient failures. Finally, we propose a location-agnostic classifier to decouple the location bias from mask prediction and enhance the feature semantics. We extensively validate our approach on three public surgical video datasets, including two benchmarks from EndoVis Challenges and one real radical prostatectomy surgery dataset GraSP. Experimental results demonstrate the promising performances of our method, which consistently achieves comparable or favorable results with previous state-of-the-art approaches.

[CV-128] OPUS: Occupancy Prediction Using a Sparse Set

链接: https://arxiv.org/abs/2409.09350
作者: Jiabao Wang,Zhaojiang Liu,Qiang Meng,Liujiang Yan,Ke Wang,Jie Yang,Wei Liu,Qibin Hou,Ming-Ming Cheng
关键词-EN: autonomous driving community, quickly gaining momentum, Occupancy prediction, Mainstream occupancy prediction, occupancy prediction works
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Occupancy prediction, aiming at predicting the occupancy status within voxelized 3D environment, is quickly gaining momentum within the autonomous driving community. Mainstream occupancy prediction works first discretize the 3D environment into voxels, then perform classification on such dense grids. However, inspection on sample data reveals that the vast majority of voxels is unoccupied. Performing classification on these empty voxels demands suboptimal computation resource allocation, and reducing such empty voxels necessitates complex algorithm designs. To this end, we present a novel perspective on the occupancy prediction task: formulating it as a streamlined set prediction paradigm without the need for explicit space modeling or complex sparsification procedures. Our proposed framework, called OPUS, utilizes a transformer encoder-decoder architecture to simultaneously predict occupied locations and classes using a set of learnable queries. Firstly, we employ the Chamfer distance loss to scale the set-to-set comparison problem to unprecedented magnitudes, making training such model end-to-end a reality. Subsequently, semantic classes are adaptively assigned using nearest neighbor search based on the learned locations. In addition, OPUS incorporates a suite of non-trivial strategies to enhance model performance, including coarse-to-fine learning, consistent point sampling, and adaptive re-weighting, etc. Finally, compared with current state-of-the-art methods, our lightest model achieves superior RayIoU on the Occ3D-nuScenes dataset at near 2x FPS, while our heaviest model surpasses previous best results by 6.1 RayIoU.

[CV-129] QTG-VQA: Question-Type-Guided Architectural for VideoQA Systems

链接: https://arxiv.org/abs/2409.09348
作者: Zhixian He,Pengcheng Zhao,Fuwei Zhang,Shujin Lin
关键词-EN: video question answering, question types, VQA systems, critical importance, under-explored to date
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the domain of video question answering (VideoQA), the impact of question types on VQA systems, despite its critical importance, has been relatively under-explored to date. However, the richness of question types directly determines the range of concepts a model needs to learn, thereby affecting the upper limit of its learning capability. This paper focuses on exploring the significance of different question types for VQA systems and their impact on performance, revealing a series of issues such as insufficient learning and model degradation due to uneven distribution of question types. Particularly, considering the significant variation in dependency on temporal information across different question types, and given that the representation of such information coincidentally represents a principal challenge and difficulty for VideoQA as opposed to ImageQA. To address these challenges, we propose QTG-VQA, a novel architecture that incorporates question-type-guided attention and adaptive learning mechanism. Specifically, as to temporal-type questions, we design Masking Frame Modeling technique to enhance temporal modeling, aimed at encouraging the model to grasp richer visual-language relationships and manage more intricate temporal dependencies. Furthermore, a novel evaluation metric tailored to question types is introduced. Experimental results confirm the effectiveness of our approach.

[CV-130] VOMTC: Vision Objects for Millimeter and Terahertz Communications

链接: https://arxiv.org/abs/2409.09330
作者: Sunwoo Kim,Yongjun Ahn,Daeyoung Park,Byonghyo Shim
关键词-EN: Recent advances, deep learning, advances in sensing, sensing and computer, opened the door
类目: Networking and Internet Architecture (cs.NI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advances in sensing and computer vision (CV) technologies have opened the door for the application of deep learning (DL)-based CV technologies in the realm of 6G wireless communications. For the successful application of this emerging technology, it is crucial to have a qualified vision dataset tailored for wireless applications (e.g., RGB images containing wireless devices such as laptops and cell phones). An aim of this paper is to propose a large-scale vision dataset referred to as Vision Objects for Millimeter and Terahertz Communications (VOMTC). The VOMTC dataset consists of 20,232 pairs of RGB and depth images obtained from a camera attached to the base station (BS), with each pair labeled with three representative object categories (person, cell phone, and laptop) and bounding boxes of the objects. Through experimental studies of the VOMTC datasets, we show that the beamforming technique exploiting the VOMTC-trained object detector outperforms conventional beamforming techniques.

[CV-131] LawDNet: Enhanced Audio-Driven Lip Synthesis via Local Affine Warping Deformation

链接: https://arxiv.org/abs/2409.09326
作者: Deng Junli,Luo Yihao,Yang Xueting,Li Siyou,Wang Wei,Guo Jinyang,Shi Ping
关键词-EN: photorealistic avatar generation, realistic virtual interactions, audio-driven lip motion, avatar generation, virtual interactions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the domain of photorealistic avatar generation, the fidelity of audio-driven lip motion synthesis is essential for realistic virtual interactions. Existing methods face two key challenges: a lack of vivacity due to limited diversity in generated lip poses and noticeable anamorphose motions caused by poor temporal coherence. To address these issues, we propose LawDNet, a novel deep-learning architecture enhancing lip synthesis through a Local Affine Warping Deformation mechanism. This mechanism models the intricate lip movements in response to the audio input by controllable non-linear warping fields. These fields consist of local affine transformations focused on abstract keypoints within deep feature maps, offering a novel universal paradigm for feature warping in networks. Additionally, LawDNet incorporates a dual-stream discriminator for improved frame-to-frame continuity and employs face normalization techniques to handle pose and scene variations. Extensive evaluations demonstrate LawDNet’s superior robustness and lip movement dynamism performance compared to previous methods. The advancements presented in this paper, including the methodologies, training data, source codes, and pre-trained models, will be made accessible to the research community.

[CV-132] Implicit Neural Representations with Fourier Kolmogorov-Arnold Networks ICASSP2025

链接: https://arxiv.org/abs/2409.09323
作者: Ali Mehrabian,Parsa Mojarad Adi,Moein Heidari,Ilker Hacihaliloglu
关键词-EN: Implicit neural representations, Implicit neural, Fourier Kolmogorov Arnold, number of parameters, provide continuous
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:Implicit neural representations (INRs) use neural networks to provide continuous and resolution-independent representations of complex signals with a small number of parameters. However, existing INR models often fail to capture important frequency components specific to each task. To address this issue, in this paper, we propose a Fourier Kolmogorov Arnold network (FKAN) for INRs. The proposed FKAN utilizes learnable activation functions modeled as Fourier series in the first layer to effectively control and learn the task-specific frequency components. In addition, the activation functions with learnable Fourier coefficients improve the ability of the network to capture complex patterns and details, which is beneficial for high-resolution and high-dimensional data. Experimental results show that our proposed FKAN model outperforms three state-of-the-art baseline schemes, and improves the peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) for the image representation task and intersection over union (IoU) for the 3D occupancy volume representation task, respectively.

[CV-133] ChildPlay-Hand: A Dataset of Hand Manipulations in the Wild

链接: https://arxiv.org/abs/2409.09319
作者: Arya Farkhondeh,Samy Tafasca,Jean-Marc Odobez
关键词-EN: gaining significant attention, numerous egocentric datasets, egocentric datasets driven, gaining significant, creation of numerous
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Hand-Object Interaction (HOI) is gaining significant attention, particularly with the creation of numerous egocentric datasets driven by AR/VR applications. However, third-person view HOI has received less attention, especially in terms of datasets. Most third-person view datasets are curated for action recognition tasks and feature pre-segmented clips of high-level daily activities, leaving a gap for in-the-wild datasets. To address this gap, we propose ChildPlay-Hand, a novel dataset that includes person and object bounding boxes, as well as manipulation actions. ChildPlay-Hand is unique in: (1) providing per-hand annotations; (2) featuring videos in uncontrolled settings with natural interactions, involving both adults and children; (3) including gaze labels from the ChildPlay-Gaze dataset for joint modeling of manipulations and gaze. The manipulation actions cover the main stages of an HOI cycle, such as grasping, holding or operating, and different types of releasing. To illustrate the interest of the dataset, we study two tasks: object in hand detection (OiH), i.e. if a person has an object in their hand, and manipulation stages (ManiS), which is more fine-grained and targets the main stages of manipulation. We benchmark various spatio-temporal and segmentation networks, exploring body vs. hand-region information and comparing pose and RGB modalities. Our findings suggest that ChildPlay-Hand is a challenging new benchmark for modeling HOI in the wild.

[CV-134] ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models

链接: https://arxiv.org/abs/2409.09318
作者: Yahan Tu,Rui Hu,Jitao Sang
关键词-EN: multimodal large language, large language models, poses a significant, significant challenge, challenge for multimodal
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Hallucination poses a significant challenge for multimodal large language models (MLLMs). However, existing benchmarks for evaluating hallucinations are static, which can lead to potential data contamination. This paper introduces ODE, an open-set, dynamic protocol for evaluating object existence hallucinations in MLLMs. Our framework employs graph structures to model associations between real-word concepts and generates novel samples for both general and domain-specific scenarios. The dynamic combination of concepts, along with various combination principles, ensures a broad sample distribution. Experimental results show that MLLMs exhibit higher hallucination rates with ODE-generated samples, effectively avoiding data contamination. Moreover, these samples can also be used for fine-tuning to improve MLLM performance on existing benchmarks.

[CV-135] nsor-Based Synchronization and the Low-Rankness of the Block Trifocal Tensor

链接: https://arxiv.org/abs/2409.09313
作者: Daniel Miao,Gilad Lerman,Joe Kileel
关键词-EN: block trifocal tensor, crucial geometric information, block trifocal, crucial geometric, three-view geometry
类目: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA); Optimization and Control (math.OC)
*备注: 31 pages, 4 figures

点击查看摘要

Abstract:The block tensor of trifocal tensors provides crucial geometric information on the three-view geometry of a scene. The underlying synchronization problem seeks to recover camera poses (locations and orientations up to a global transformation) from the block trifocal tensor. We establish an explicit Tucker factorization of this tensor, revealing a low multilinear rank of (6,4,4) independent of the number of cameras under appropriate scaling conditions. We prove that this rank constraint provides sufficient information for camera recovery in the noiseless case. The constraint motivates a synchronization algorithm based on the higher-order singular value decomposition of the block trifocal tensor. Experimental comparisons with state-of-the-art global synchronization methods on real datasets demonstrate the potential of this algorithm for significantly improving location estimation accuracy. Overall this work suggests that higher-order interactions in synchronization problems can be exploited to improve performance, beyond the usual pairwise-based approaches.

[CV-136] Registration between Point Cloud Streams and Sequential Bounding Boxes via Gradient Descent

链接: https://arxiv.org/abs/2409.09312
作者: Xuesong Li,Xinge Zhu,Yuexin Ma,Subhan Khan,Jose Guivant
关键词-EN: point cloud, point cloud streams, registering sequential bounding, sequential bounding boxes, popular point cloud
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:In this paper, we propose an algorithm for registering sequential bounding boxes with point cloud streams. Unlike popular point cloud registration techniques, the alignment of the point cloud and the bounding box can rely on the properties of the bounding box, such as size, shape, and temporal information, which provides substantial support and performance gains. Motivated by this, we propose a new approach to tackle this problem. Specifically, we model the registration process through an overall objective function that includes the final goal and all constraints. We then optimize the function using gradient descent. Our experiments show that the proposed method performs remarkably well with a 40% improvement in IoU and demonstrates more robust registration between point cloud streams and sequential bounding boxes

[CV-137] Keypoints-Integrated Instruction-Following Data Generation for Enhanced Human Pose Understanding in Multimodal Models

链接: https://arxiv.org/abs/2409.09306
作者: Dewen Zhang,Wangpeng An,Hayaru Shouno
关键词-EN: general visual understanding, Current multimodal models, Current multimodal, well-suited for general, visual understanding tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Current multimodal models are well-suited for general visual understanding tasks. However, they perform inadequately when handling complex visual tasks related to human poses and actions, primarily due to the lack of specialized instruction-following data. We introduce a new method for generating such data by integrating human keypoints with traditional visual features like captions and bounding boxes. Our approach produces datasets designed for fine-tuning models to excel in human-centric activities, focusing on three specific types: conversation, detailed description, and complex reasoning. We fine-tuned the LLaVA-7B model with this novel dataset, achieving significant improvements across various human pose-related tasks. Experimental results show an overall improvement of 21.18% compared to the original LLaVA-7B model. These findings demonstrate the effectiveness of keypoints-assisted data in enhancing multimodal models.

[CV-138] ManiDext: Hand-Object Manipulation Synthesis via Continuous Correspondence Embeddings and Residual-Guided Diffusion

链接: https://arxiv.org/abs/2409.09300
作者: Jiajun Zhang,Yuxiang Zhang,Liang An,Mengcheng Li,Hongwen Zhang,Zonghai Hu,Yebin Liu
关键词-EN: Dynamic and dexterous, hand, complex challenge, requiring the synchronization, presents a complex
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Dynamic and dexterous manipulation of objects presents a complex challenge, requiring the synchronization of hand motions with the trajectories of objects to achieve seamless and physically plausible interactions. In this work, we introduce ManiDext, a unified hierarchical diffusion-based framework for generating hand manipulation and grasp poses based on 3D object trajectories. Our key insight is that accurately modeling the contact correspondences between objects and hands during interactions is crucial. Therefore, we propose a continuous correspondence embedding representation that specifies detailed hand correspondences at the vertex level between the object and the hand. This embedding is optimized directly on the hand mesh in a self-supervised manner, with the distance between embeddings reflecting the geodesic distance. Our framework first generates contact maps and correspondence embeddings on the object’s surface. Based on these fine-grained correspondences, we introduce a novel approach that integrates the iterative refinement process into the diffusion process during the second stage of hand pose generation. At each step of the denoising process, we incorporate the current hand pose residual as a refinement target into the network, guiding the network to correct inaccurate hand poses. Introducing residuals into each denoising step inherently aligns with traditional optimization process, effectively merging generation and refinement into a single unified framework. Extensive experiments demonstrate that our approach can generate physically plausible and highly realistic motions for various tasks, including single and bimanual hand grasping as well as manipulating both rigid and articulated objects. Code will be available for research purposes.

[CV-139] Associate Everything Detected: Facilitating Tracking-by-Detection to the Unknown

链接: https://arxiv.org/abs/2409.09293
作者: Zimeng Fang,Chao Liang,Xue Zhou,Shuyuan Zhu,Xi Li
关键词-EN: highly promising branch, Multi-object tracking, computer vision, promising branch, field of computer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multi-object tracking (MOT) emerges as a pivotal and highly promising branch in the field of computer vision. Classical closed-vocabulary MOT (CV-MOT) methods aim to track objects of predefined categories. Recently, some open-vocabulary MOT (OV-MOT) methods have successfully addressed the problem of tracking unknown categories. However, we found that the CV-MOT and OV-MOT methods each struggle to excel in the tasks of the other. In this paper, we present a unified framework, Associate Everything Detected (AED), that simultaneously tackles CV-MOT and OV-MOT by integrating with any off-the-shelf detector and supports unknown categories. Different from existing tracking-by-detection MOT methods, AED gets rid of prior knowledge (e.g. motion cues) and relies solely on highly robust feature learning to handle complex trajectories in OV-MOT tasks while keeping excellent performance in CV-MOT tasks. Specifically, we model the association task as a similarity decoding problem and propose a sim-decoder with an association-centric learning mechanism. The sim-decoder calculates similarities in three aspects: spatial, temporal, and cross-clip. Subsequently, association-centric learning leverages these threefold similarities to ensure that the extracted features are appropriate for continuous tracking and robust enough to generalize to unknown categories. Compared with existing powerful OV-MOT and CV-MOT methods, AED achieves superior performance on TAO, SportsMOT, and DanceTrack without any prior knowledge. Our code is available at this https URL.

[CV-140] StyleTalk: A Unified Framework for Controlling the Speaking Styles of Talking Heads

链接: https://arxiv.org/abs/2409.09292
作者: Suzhen Wang,Yifeng Ma,Yu Ding,Zhipeng Hu,Changjie Fan,Tangjie Lv,Zhidong Deng,Xin Yu
关键词-EN: Individuals have unique, speaking styles, speaking, head pose styles, personalized speaking styles
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: TPAMI 2024

点击查看摘要

Abstract:Individuals have unique facial expression and head pose styles that reflect their personalized speaking styles. Existing one-shot talking head methods cannot capture such personalized characteristics and therefore fail to produce diverse speaking styles in the final videos. To address this challenge, we propose a one-shot style-controllable talking face generation method that can obtain speaking styles from reference speaking videos and drive the one-shot portrait to speak with the reference speaking styles and another piece of audio. Our method aims to synthesize the style-controllable coefficients of a 3D Morphable Model (3DMM), including facial expressions and head movements, in a unified framework. Specifically, the proposed framework first leverages a style encoder to extract the desired speaking styles from the reference videos and transform them into style codes. Then, the framework uses a style-aware decoder to synthesize the coefficients of 3DMM from the audio input and style codes. During decoding, our framework adopts a two-branch architecture, which generates the stylized facial expression coefficients and stylized head movement coefficients, respectively. After obtaining the coefficients of 3DMM, an image renderer renders the expression coefficients into a specific person’s talking-head video. Extensive experiments demonstrate that our method generates visually authentic talking head videos with diverse speaking styles from only one portrait image and an audio clip.

[CV-141] Infrared and Visible Image Fusion with Hierarchical Human Perception

链接: https://arxiv.org/abs/2409.09291
作者: Guang Yang,Jie Li,Xin Liu,Zhusi Zhong,Xinbo Gao
关键词-EN: Large Vision-Language Model, Large Vision-Language, Vision-Language Model, Image fusion combines, fusion combines images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image fusion combines images from multiple domains into one image, containing complementary information from source domains. Existing methods take pixel intensity, texture and high-level vision task information as the standards to determine preservation of information, lacking enhancement for human perception. We introduce an image fusion method, Hierarchical Perception Fusion (HPFusion), which leverages Large Vision-Language Model to incorporate hierarchical human semantic priors, preserving complementary information that satisfies human visual system. We propose multiple questions that humans focus on when viewing an image pair, and answers are generated via the Large Vision-Language Model according to images. The texts of answers are encoded into the fusion network, and the optimization also aims to guide the human semantic distribution of the fused image more similarly to source images, exploring complementary information within the human perception domain. Extensive experiments demonstrate our HPFusoin can achieve high-quality fusion results both for information preservation and human visual enhancement.

[CV-142] SAM-OCTA2: Layer Sequence OCTA Segmentation with Fine-tuned Segment Anything Model 2

链接: https://arxiv.org/abs/2409.09286
作者: Xinrun Chen,Chengliang Wang,Haojian Ning,Mengzhan Zhang,Mei Shen,Shiying Li
关键词-EN: coherence tomography angiography, optical coherence tomography, tomography angiography, precise analysis, analysis of optical
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Segmentation of indicated targets aids in the precise analysis of optical coherence tomography angiography (OCTA) samples. Existing segmentation methods typically perform on 2D projection targets, making it challenging to capture the variance of segmented objects through the 3D volume. To address this limitation, the low-rank adaptation technique is adopted to fine-tune the Segment Anything Model (SAM) version 2, enabling the tracking and segmentation of specified objects across the OCTA scanning layer sequence. To further this work, a prompt point generation strategy in frame sequence and a sparse annotation method to acquire retinal vessel (RV) layer masks are proposed. This method is named SAM-OCTA2 and has been experimented on the OCTA-500 dataset. It achieves state-of-the-art performance in segmenting the foveal avascular zone (FAZ) on regular 2D en-face and effectively tracks local vessels across scanning layer sequences. The code is available at: this https URL.

[CV-143] LabellessFace: Fair Metric Learning for Face Recognition without Attribute Labels

链接: https://arxiv.org/abs/2409.09274
作者: Tetsushi Ohki,Yuya Sato,Masakatsu Nishigaki,Koichi Ito
关键词-EN: major challenges, Demographic, Demographic bias, face recognition, recognition systems
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Demographic bias is one of the major challenges for face recognition systems. The majority of existing studies on demographic biases are heavily dependent on specific demographic groups or demographic classifier, making it difficult to address performance for unrecognised groups. This paper introduces ``LabellessFace’', a novel framework that improves demographic bias in face recognition without requiring demographic group labeling typically required for fairness considerations. We propose a novel fairness enhancement metric called the class favoritism level, which assesses the extent of favoritism towards specific classes across the dataset. Leveraging this metric, we introduce the fair class margin penalty, an extension of existing margin-based metric learning. This method dynamically adjusts learning parameters based on class favoritism levels, promoting fairness across all attributes. By treating each class as an individual in facial recognition systems, we facilitate learning that minimizes biases in authentication accuracy among individuals. Comprehensive experiments have demonstrated that our proposed method is effective for enhancing fairness while maintaining authentication accuracy.

[CV-144] Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks Domains and Knowledge Types

链接: https://arxiv.org/abs/2409.09269
作者: Neelabh Sinha,Vinija Jain,Aman Chadha
关键词-EN: aid user experience, Visual Question-Answering, achieving good results, user experience, zero-shot inference
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 8 pages + references + 6 pages of Appendix

点击查看摘要

Abstract:Visual Question-Answering (VQA) has become a key use-case in several applications to aid user experience, particularly after Vision-Language Models (VLMs) achieving good results in zero-shot inference. But evaluating different VLMs for an application requirement using a standardized framework in practical settings is still challenging. This paper introduces a comprehensive framework for evaluating VLMs tailored to VQA tasks in practical settings. We present a novel dataset derived from established VQA benchmarks, annotated with task types, application domains, and knowledge types, three key practical aspects on which tasks can vary. We also introduce GoEval, a multimodal evaluation metric developed using GPT-4o, achieving a correlation factor of 56.71% with human judgments. Our experiments with ten state-of-the-art VLMs reveals that no single model excelling universally, making appropriate selection a key design decision. Proprietary models such as Gemini-1.5-Pro and GPT-4o-mini generally outperform others, though open-source models like InternVL-2-8B and CogVLM-2-Llama-3-19B demonstrate competitive strengths in specific contexts, while providing additional advantages. This study guides the selection of VLMs based on specific task requirements and resource constraints, and can also be extended to other vision-language tasks.

[CV-145] VSFormer: Mining Correlations in Flexible View Set for Multi-view 3D Shape Understanding

链接: https://arxiv.org/abs/2409.09254
作者: Hongyu Sun,Yongcai Wang,Peng Wang,Haoran Deng,Xudong Cai,Deying Li
关键词-EN: View-based methods, demonstrated promising performance, methods have demonstrated, demonstrated promising, shape understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted by TVCG 2024

点击查看摘要

Abstract:View-based methods have demonstrated promising performance in 3D shape understanding. However, they tend to make strong assumptions about the relations between views or learn the multi-view correlations indirectly, which limits the flexibility of exploring inter-view correlations and the effectiveness of target tasks. To overcome the above problems, this paper investigates flexible organization and explicit correlation learning for multiple views. In particular, we propose to incorporate different views of a 3D shape into a permutation-invariant set, referred to as \emphView Set, which removes rigid relation assumptions and facilitates adequate information exchange and fusion among views. Based on that, we devise a nimble Transformer model, named \emphVSFormer, to explicitly capture pairwise and higher-order correlations of all elements in the set. Meanwhile, we theoretically reveal a natural correspondence between the Cartesian product of a view set and the correlation matrix in the attention mechanism, which supports our model design. Comprehensive experiments suggest that VSFormer has better flexibility, efficient inference efficiency and superior performance. Notably, VSFormer reaches state-of-the-art results on various 3d recognition datasets, including ModelNet40, ScanObjectNN and RGBD. It also establishes new records on the SHREC’17 retrieval benchmark. The code and datasets are available at \urlthis https URL.

[CV-146] Robust Training of Neural Networks at Arbitrary Precision and Sparsity

链接: https://arxiv.org/abs/2409.09245
作者: Chengxi Ye,Grace Chu,Yanfeng Liu,Yichi Zhang,Lukasz Lew,Andrew Howard
关键词-EN: sparsification introduce obstacles, discontinuous operations inherent, obstacles to backpropagation, discontinuous operations, introduce obstacles
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:The discontinuous operations inherent in quantization and sparsification introduce obstacles to backpropagation. This is particularly challenging when training deep neural networks in ultra-low precision and sparse regimes. We propose a novel, robust, and universal solution: a denoising affine transform that stabilizes training under these challenging conditions. By formulating quantization and sparsification as perturbations during training, we derive a perturbation-resilient approach based on ridge regression. Our solution employs a piecewise constant backbone model to ensure a performance lower bound and features an inherent noise reduction mechanism to mitigate perturbation-induced corruption. This formulation allows existing models to be trained at arbitrarily low precision and sparsity levels with off-the-shelf recipes. Furthermore, our method provides a novel perspective on training temporal binary neural networks, contributing to ongoing efforts to narrow the gap between artificial and biological neural networks.

[CV-147] Investigation of Hierarchical Spectral Vision Transformer Architecture for Classification of Hyperspectral Imagery

链接: https://arxiv.org/abs/2409.09244
作者: Wei Liu,Saurabh Prasad,Melba Crawford
关键词-EN: remotely sensed data, vision Transformer architecture, vision Transformers, HSI classification, Transformers out-performing CNN
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: \c{opyright} 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:In the past three years, there has been significant interest in hyperspectral imagery (HSI) classification using vision Transformers for analysis of remotely sensed data. Previous research predominantly focused on the empirical integration of convolutional neural networks (CNNs) to augment the network’s capability to extract local feature information. Yet, the theoretical justification for vision Transformers out-performing CNN architectures in HSI classification remains a question. To address this issue, a unified hierarchical spectral vision Transformer architecture, specifically tailored for HSI classification, is investigated. In this streamlined yet effective vision Transformer architecture, multiple mixer modules are strategically integrated separately. These include the CNN-mixer, which executes convolution operations; the spatial self-attention (SSA)-mixer and channel self-attention (CSA)-mixer, both of which are adaptations of classical self-attention blocks; and hybrid models such as the SSA+CNN-mixer and CSA+CNN-mixer, which merge convolution with self-attention operations. This integration facilitates the development of a broad spectrum of vision Transformer-based models tailored for HSI classification. In terms of the training process, a comprehensive analysis is performed, contrasting classical CNN models and vision Transformer-based counterparts, with particular attention to disturbance robustness and the distribution of the largest eigenvalue of the Hessian. From the evaluations conducted on various mixer models rooted in the unified architecture, it is concluded that the unique strength of vision Transformers can be attributed to their overarching architecture, rather than being exclusively reliant on individual multi-head self-attention (MSA) components.

[CV-148] Multi-modal Speech Transformer Decoders: When Do Multiple Modalities Improve Accuracy?

链接: https://arxiv.org/abs/2409.09221
作者: Yiwen Guan,Viet Anh Trinh,Vivek Voleti,Jacob Whitehill
关键词-EN: Decoder-only discrete-token language, discrete-token language models, recently achieved significant, achieved significant success, Decoder-only discrete-token
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Decoder-only discrete-token language models have recently achieved significant success in automatic speech recognition. However, systematic analyses of how different modalities impact performance in specific scenarios remain limited. In this paper, we investigate the effects of multiple modalities on recognition accuracy on both synthetic and real-world datasets. Our experiments suggest that: (1) Integrating more modalities can increase accuracy; in particular, our paper is, to our best knowledge, the first to show the benefit of combining audio, image context, and lip information; (2) Images as a supplementary modality for speech recognition provide the greatest benefit at moderate noise levels, moreover, they exhibit a different trend compared to inherently synchronized modalities like lip movements; (3) Performance improves on both synthetic and real-world datasets when the most relevant visual information is filtered as a preprocessing step.

[CV-149] Are Sparse Neural Networks Better Hard Sample Learners? BMVC2024

链接: https://arxiv.org/abs/2409.09196
作者: Qiao Xiao,Boqian Wu,Lu Yin,Christopher Neil Gadzinski,Tianjin Huang,Mykola Pechenizkiy,Decebal Constantin Mocanu
关键词-EN: demonstrated impressive progress, Sparse Neural Networks, deep neural networks, impressive progress, noisy and intricate
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at British Machine Vision Conference (BMVC 2024)

点击查看摘要

Abstract:While deep learning has demonstrated impressive progress, it remains a daunting challenge to learn from hard samples as these samples are usually noisy and intricate. These hard samples play a crucial role in the optimal performance of deep neural networks. Most research on Sparse Neural Networks (SNNs) has focused on standard training data, leaving gaps in understanding their effectiveness on complex and challenging data. This paper’s extensive investigation across scenarios reveals that most SNNs trained on challenging samples can often match or surpass dense models in accuracy at certain sparsity levels, especially with limited data. We observe that layer-wise density ratios tend to play an important role in SNN performance, particularly for methods that train from scratch without pre-trained initialization. These insights enhance our understanding of SNNs’ behavior and potential for efficient learning approaches in data-centric AI. Our code is publicly available at: \urlthis https URL.

[CV-150] Hierarchical Hypercomplex Network for Multimodal Emotion Recognition

链接: https://arxiv.org/abs/2409.09194
作者: Eleonora Lopez,Aurelio Uncini,Danilo Comminiello
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: The paper has been accepted at MLSP 2024

点击查看摘要

[CV-151] ransformer with Controlled Attention for Synchronous Motion Captioning

链接: https://arxiv.org/abs/2409.09177
作者: Karim Radouane,Sylvie Ranwez,Julien Lagarde,Andon Tchechmedjiev
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[CV-152] Adaptive Multi-Modal Control of Digital Human Hand Synthesis Using a Region-Aware Cycle Loss ECCV2024

链接: https://arxiv.org/abs/2409.09149
作者: Qifan Fu,Xiaohang Yang,Muhammad Asad,Changjae Oh,Shanxin Yuan,Gregory Slabaugh
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper has been accepted by the ECCV 2024 HANDS workshop

点击查看摘要

[CV-153] PrimeDepth: Efficient Monocular Depth Estimation with a Stable Diffusion Preimage

链接: https://arxiv.org/abs/2409.09144
作者: Denis Zavadski,Damjan Kalšan,Carsten Rother
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-154] rimming the Risk: Towards Reliable Continuous Training for Deep Learning Inspection Systems

链接: https://arxiv.org/abs/2409.09108
作者: Altaf Allah Abbassi,Houssem Ben Braiek,Foutse Khomh,Thomas Reid
关键词-EN:
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
*备注:

点击查看摘要

[CV-155] Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU

链接: https://arxiv.org/abs/2409.09086
作者: Zhenyu Ning,Jieru Zhao,Qihao Jin,Wenchao Ding,Minyi Guo
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
*备注:

点击查看摘要

[CV-156] HESSO: Towards Automatic Efficient and User Friendly Any Neural Network Training and Pruning

链接: https://arxiv.org/abs/2409.09085
作者: Tianyi Chen,Xiaoyi Qu,David Aponte,Colby Banbury,Jongwoo Ko,Tianyu Ding,Yong Ma,Vladimir Lyapunov,Ilya Zharkov,Luming Liang
关键词-EN:
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: preprint

点击查看摘要

[CV-157] AutoGeo: Automating Geometric Image Dataset Creation for Enhanced Geometry Understanding

链接: https://arxiv.org/abs/2409.09039
作者: Zihan Huang,Tao Wu,Wang Lin,Shengyu Zhang,Jingyuan Chen,Fei Wu
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-158] AdR-Gaussian: Accelerating Gaussian Splatting with Adaptive Radius SIGGRAPH

链接: https://arxiv.org/abs/2409.08669
作者: Xinzhe Wang,Ran Yi,Lizhuang Ma
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: SIGGRAPH Asia 2024 Conference Papers (SA Conference Papers '24), December 03-06, 2024, Tokyo, Japan

点击查看摘要

[CV-159] Evaluating Modern Approaches in 3D Scene Reconstruction: NeRF vs Gaussian-Based Methods

链接: https://arxiv.org/abs/2408.04268
作者: Yiming Zhou,Zixuan Zeng,Andi Chen,Xiaofan Zhou,Haowei Ni,Shiyao Zhang,Panfeng Li,Liangxi Liu,Mengyao Zheng,Xupeng Chen
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by 2024 6th International Conference on Data-driven Optimization of Complex Systems

点击查看摘要

[CV-160] Regional Style and Color Transfer

链接: https://arxiv.org/abs/2404.13880
作者: Zhicheng Ding,Panfeng Li,Qikai Yang,Siyang Li,Qingtian Gong
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by 2024 5th International Conference on Computer Vision, Image and Deep Learning

点击查看摘要

[CV-161] Exploring Diverse Methods in Visual Question Answering

链接: https://arxiv.org/abs/2404.13565
作者: Panfeng Li,Qikai Yang,Xieming Geng,Wenjing Zhou,Zhicheng Ding,Yi Nian
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted by 2024 5th International Conference on Electronic Communication and Artificial Intelligence

点击查看摘要

[CV-162] Confidence Trigger Detection: Accelerating Real-time Tracking-by-detection Systems

链接: https://arxiv.org/abs/1902.00615
作者: Zhicheng Ding,Zhixin Lai,Siyang Li,Panfeng Li,Qikai Yang,Edward Wong
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by 2024 5th International Conference on Electronic Communication and Artificial Intelligence

点击查看摘要

[CV-163] Contextual Hourglass Network for Semantic Segmentation of High Resolution Aerial Imagery

链接: https://arxiv.org/abs/1810.12813
作者: Panfeng Li,Youzuo Lin,Emily Schultz-Fellenz
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by 2024 5th International Conference on Electronic Communication and Artificial Intelligence

点击查看摘要

[CV-164] VAE-QWGAN: Improving Quantum GANs for High Resolution Image Generation

链接: https://arxiv.org/abs/2409.10339
作者: Aaron Mark Thomas,Sharu Theresa Jose
关键词-EN:
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 5 pages, 8 figures

点击查看摘要

[CV-165] SPAC: Sampling-based Progressive Attribute Compression for Dense Point Clouds

链接: https://arxiv.org/abs/2409.10293
作者: Xiaolong Mao,Hui Yuan,Tian Guo,Shiqi Jiang,Raouf Hamzaoui,Sam Kwong
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 136pages, 13 figures

点击查看摘要

[CV-166] Self-Updating Vehicle Monitoring Framework Employing Distributed Acoustic Sensing towards Real-World Settings

链接: https://arxiv.org/abs/2409.10259
作者: Xi Wang,Xin Liu,Songming Zhu,Zhanwen Li,Lina Gao
关键词-EN:
类目: Geophysics (physics.geo-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

[CV-167] FGR-Net:Interpretable fundus imagegradeability classification based on deepreconstruction learning

链接: https://arxiv.org/abs/2409.10246
作者: Saif Khalid,Hatem A. Rashwan,Saddam Abdulwahab,Mohamed Abdel-Nasser,Facundo Manuel Quiroga,Domenec Puig
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-168] Data-Centric Strategies for Overcoming PET/CT Heterogeneity: Insights from the AutoPET III Lesion Segmentation Challenge

链接: https://arxiv.org/abs/2409.10120
作者: Balint Kovacs,Shuhan Xiao,Maximilian Rokuss,Constantin Ulrich,Fabian Isensee,Klaus H. Maier-Hein
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Contribution to the data-centric task of the autoPET III Challenge 2024

点击查看摘要

[CV-169] Cross-modality image synthesis from TOF-MRA to CTA using diffusion-based models

链接: https://arxiv.org/abs/2409.10089
作者: Alexander Koch,Orhun Utku Aydin,Adam Hilbert,Jana Rieger,Satoru Tanioka,Fujimaro Ishida,Dietmar Frey
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-170] Domain and Content Adaptive Convolutions for Cross-Domain Adenocarcinoma Segmentation

链接: https://arxiv.org/abs/2409.09797
作者: Frauke Wilm,Mathias Öttl,Marc Aubreville,Katharina Breininger
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 1 figure, 1 table

点击查看摘要

[CV-171] Universal Topology Refinement for Medical Image Segmentation with Polynomial Feature Synthesis MICCAI2024

链接: https://arxiv.org/abs/2409.09796
作者: Liu Li,Hanchun Wang,Matthew Baugh,Qiang Ma,Weitong Zhang,Cheng Ouyang,Daniel Rueckert,Bernhard Kainz
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by the 27th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2024)

点击查看摘要

[CV-172] Learning Two-factor Representation for Magnetic Resonance Image Super-resolution

链接: https://arxiv.org/abs/2409.09731
作者: Weifeng Wei,Heng Chen,Pengxiang Su
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-173] Reliable Multi-View Learning with Conformal Prediction for Aortic Stenosis Classification in Echocardiography MICCAI

链接: https://arxiv.org/abs/2409.09680
作者: Ang Nan Gu,Michael Tsang,Hooman Vaseli,Teresa Tsang,Purang Abolmaesumi
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: This preprint has not undergone any post-submission improvements or corrections. The Version of Record of this contribution is published in: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Springer (2024) under the same title

点击查看摘要

[CV-174] MANGO: Disentangled Image Transformation Manifolds with Grouped Operators ICASSP2025

链接: https://arxiv.org/abs/2409.09542
作者: Brighton Ancelin,Yenho Chen,Peimeng Guan,Chiraag Kaushik,Belen Martin-Urcelay,Alex Saad-Falcon,Nakul Singh
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Submitted to IEEE ICASSP 2025. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

[CV-175] Self-Prompting Polyp Segmentation in Colonoscopy using Hybrid Yolo-SAM 2 Model

链接: https://arxiv.org/abs/2409.09484
作者: Mobina Mansoori,Sajjad Shahabodini,Jamshid Abouei,Konstantinos N. Plataniotis,Arash Mohammadi
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

[CV-176] From FDG to PSMA: A Hitchhikers Guide to Multitracer Multicenter Lesion Segmentation in PET/CT Imaging

链接: https://arxiv.org/abs/2409.09478
作者: Maximilian Rokuss,Balint Kovacs,Yannick Kirchhoff,Shuhan Xiao,Constantin Ulrich,Klaus H. Maier-Hein,Fabian Isensee
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-177] Estimating Neural Orientation Distribution Fields on High Resolution Diffusion MRI Scans MICCAI

链接: https://arxiv.org/abs/2409.09387
作者: Mohammed Munzer Dwedari,William Consagra,Philip Müller,Özgün Turgut,Daniel Rueckert,Yogesh Rathi
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 8 figures, conference: Medical Image Computing and Computer-Assisted Intervention (MICCAI)

点击查看摘要

[CV-178] MotionTTT: 2D Test-Time-Training Motion Estimation for 3D Motion Corrected MRI

链接: https://arxiv.org/abs/2409.09370
作者: Tobit Klug,Kun Wang,Stefan Ruschke,Reinhard Heckel
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-179] Real-Time Stochastic Terrain Mapping and Processing for Autonomous Safe Landing

链接: https://arxiv.org/abs/2409.09309
作者: Kento Tomita,Koki Ho
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-180] Spectral U-Net: Enhancing Medical Image Segmentation via Spectral Decomposition

链接: https://arxiv.org/abs/2409.09216
作者: Yaopeng Peng,Milan Sonka,Danny Z. Chen
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-181] FiAt-Net: Detecting Fibroatheroma Plaque Cap in 3D Intravascular OCT Images

链接: https://arxiv.org/abs/2409.09188
作者: Yaopeng Peng,Zhi Chen,Andreas Wahle,Tomas Kovarnik,Milan Sonk,Danny Z. Chen
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-182] Phikon-v2 A large and public feature extractor for biomarker prediction

链接: https://arxiv.org/abs/2409.09173
作者: Alexandre Filiot,Paul Jacob,Alice Mac Kain,Charlie Saillard
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-183] Deep learning-based classification of breast cancer molecular subtypes from HE whole-slide images

链接: https://arxiv.org/abs/2409.09053
作者: Masoud Tafavvoghi,Anders Sildnes,Mehrdad Rakaee,Nikita Shvetsov,Lars Ailo Bongo,Lill-Tove Rasmussen Busund,Kajsa Møllersen
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 5 figures (+4 supplementary figures), 4 tables

点击查看摘要

[CV-184] OrthoDoc: Multimodal Large Language Model for Assisting Diagnosis in Computed Tomography

链接: https://arxiv.org/abs/2409.09052
作者: Youzhu Jin,Yichen Zhang
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 1 figure

点击查看摘要

机器学习

[LG-0] RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

链接: https://arxiv.org/abs/2409.10516
作者: Di Liu,Meng Chen,Baotong Lu,Huiqiang Jiang,Zhenhua Han,Qianxi Zhang,Qi Chen,Chengruidong Zhang,Bailu Ding,Kai Zhang,Chen Chen,Fan Yang,Yuqing Yang,Lili Qiu
关键词-EN: Transformer-based large Language, large Language Models, Transformer-based large, large Language, Language Models
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: 16 pages

点击查看摘要

Abstract:Transformer-based large Language Models (LLMs) become increasingly important in various domains. However, the quadratic time complexity of attention operation poses a significant challenge for scaling to longer contexts due to the extremely high inference latency and GPU memory consumption for caching key-value (KV) vectors. This paper proposes RetrievalAttention, a training-free approach to accelerate attention computation. To leverage the dynamic sparse property of attention, RetrievalAttention builds approximate nearest neighbor search (ANNS) indexes upon KV vectors in CPU memory and retrieves the most relevant ones via vector search during generation. Due to the out-of-distribution (OOD) between query vectors and key vectors, off-the-shelf ANNS indexes still need to scan O(N) (usually 30% of all keys) data for accurate retrieval, which fails to exploit the high sparsity. RetrievalAttention first identifies the OOD challenge of ANNS-based attention, and addresses it via an attention-aware vector search algorithm that can adapt to queries and only access 1–3% of data, thus achieving a sub-linear time complexity. RetrievalAttention greatly reduces the inference cost of long-context LLM with much lower GPU memory requirements while maintaining the model accuracy. Especially, RetrievalAttention only needs 16GB GPU memory for serving 128K tokens in LLMs with 8B parameters, which is capable of generating one token in 0.188 seconds on a single NVIDIA RTX4090 (24GB).

[LG-1] Causal Language Modeling Can Elicit Search and Reasoning Capabilities on Logic Puzzles

链接: https://arxiv.org/abs/2409.10502
作者: Kulin Shah,Nishanth Dikkala,Xin Wang,Rina Panigrahy
关键词-EN: Large Language Models, Causal language modeling, yielded remarkable capabilities, Large Language, Causal language
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: 26 pages

点击查看摘要

Abstract:Causal language modeling using the Transformer architecture has yielded remarkable capabilities in Large Language Models (LLMs) over the last few years. However, the extent to which fundamental search and reasoning capabilities emerged within LLMs remains a topic of ongoing debate. In this work, we study if causal language modeling can learn a complex task such as solving Sudoku puzzles. To solve a Sudoku, the model is first required to search over all empty cells of the puzzle to decide on a cell to fill and then apply an appropriate strategy to fill the decided cell. Sometimes, the application of a strategy only results in thinning down the possible values in a cell rather than concluding the exact value of the cell. In such cases, multiple strategies are applied one after the other to fill a single cell. We observe that Transformer models trained on this synthetic task can indeed learn to solve Sudokus (our model solves 94.21% of the puzzles fully correctly) when trained on a logical sequence of steps taken by a solver. We find that training Transformers with the logical sequence of steps is necessary and without such training, they fail to learn Sudoku. We also extend our analysis to Zebra puzzles (known as Einstein puzzles) and show that the model solves 92.04 % of the puzzles fully correctly. In addition, we study the internal representations of the trained Transformer and find that through linear probing, we can decode information about the set of possible values in any given cell from them, pointing to the presence of a strong reasoning engine implicit in the Transformer weights.

[LG-2] Partial Distribution Matching via Partial Wasserstein Adversarial Networks

链接: https://arxiv.org/abs/2409.10499
作者: Zi-Ming Wang,Nan Xue,Ling Lei,Rebecka Jörnsten,Gui-Song Xia
关键词-EN: learning problem seeking, machine learning problem, fundamental machine learning, learning problem, problem seeking
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper studies the problem of distribution matching (DM), which is a fundamental machine learning problem seeking to robustly align two probability distributions. Our approach is established on a relaxed formulation, called partial distribution matching (PDM), which seeks to match a fraction of the distributions instead of matching them completely. We theoretically derive the Kantorovich-Rubinstein duality for the partial Wasserstain-1 (PW) discrepancy, and develop a partial Wasserstein adversarial network (PWAN) that efficiently approximates the PW discrepancy based on this dual form. Partial matching can then be achieved by optimizing the network using gradient descent. Two practical tasks, point set registration and partial domain adaptation are investigated, where the goals are to partially match distributions in 3D space and high-dimensional feature space respectively. The experiment results confirm that the proposed PWAN effectively produces highly robust matching results, performing better or on par with the state-of-the-art methods.

[LG-3] MusicLIME: Explainable Multimodal Music Understanding

链接: https://arxiv.org/abs/2409.10496
作者: Theodoros Sotirou,Vassilis Lyberatos,Orfeas Menis Mastromichalakis,Giorgos Stamou
关键词-EN: capture the complex, complex interplay, music understanding tasks, multimodal music models, audio and lyrics
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: GitHub repository: this https URL

点击查看摘要

Abstract:Multimodal models are critical for music understanding tasks, as they capture the complex interplay between audio and lyrics. However, as these models become more prevalent, the need for explainability grows-understanding how these systems make decisions is vital for ensuring fairness, reducing bias, and fostering trust. In this paper, we introduce MusicLIME, a model-agnostic feature importance explanation method designed for multimodal music models. Unlike traditional unimodal methods, which analyze each modality separately without considering the interaction between them, often leading to incomplete or misleading explanations, MusicLIME reveals how audio and lyrical features interact and contribute to predictions, providing a holistic view of the model’s decision-making. Additionally, we enhance local explanations by aggregating them into global explanations, giving users a broader perspective of model behavior. Through this work, we contribute to improving the interpretability of multimodal music models, empowering users to make informed choices, and fostering more equitable, fair, and transparent music understanding systems.

[LG-4] Flash STU: Fast Spectral Transform Units

链接: https://arxiv.org/abs/2409.10489
作者: Y. Isabel Liu,Windsor Nguyen,Yagiz Devre,Evan Dogariu,Anirudha Majumdar,Elad Hazan
关键词-EN: Spectral Transform Unit, open source PyTorch, source PyTorch implementation, Transform Unit, Spectral Transform
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper describes an efficient, open source PyTorch implementation of the Spectral Transform Unit. We investigate sequence prediction tasks over several modalities including language, robotics, and simulated dynamical systems. We find that for the same parameter count, the STU and its variants outperform the Transformer as well as other leading state space models across various modalities.

[LG-5] Kolmogorov-Arnold Networks in Low-Data Regimes: A Comparative Study with Multilayer Perceptrons

链接: https://arxiv.org/abs/2409.10463
作者: Farhad Pourkamali-Anaraki
关键词-EN: Multilayer Perceptrons, model complex relationships, deep learning, complex relationships, cornerstone in deep
类目: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注: 15 pages, 9 figures

点击查看摘要

Abstract:Multilayer Perceptrons (MLPs) have long been a cornerstone in deep learning, known for their capacity to model complex relationships. Recently, Kolmogorov-Arnold Networks (KANs) have emerged as a compelling alternative, utilizing highly flexible learnable activation functions directly on network edges, a departure from the neuron-centric approach of MLPs. However, KANs significantly increase the number of learnable parameters, raising concerns about their effectiveness in data-scarce environments. This paper presents a comprehensive comparative study of MLPs and KANs from both algorithmic and experimental perspectives, with a focus on low-data regimes. We introduce an effective technique for designing MLPs with unique, parameterized activation functions for each neuron, enabling a more balanced comparison with KANs. Using empirical evaluations on simulated data and two real-world data sets from medicine and engineering, we explore the trade-offs between model complexity and accuracy, with particular attention to the role of network depth. Our findings show that MLPs with individualized activation functions achieve significantly higher predictive accuracy with only a modest increase in parameters, especially when the sample size is limited to around one hundred. For example, in a three-class classification problem within additive manufacturing, MLPs achieve a median accuracy of 0.91, significantly outperforming KANs, which only reach a median accuracy of 0.53 with default hyperparameters. These results offer valuable insights into the impact of activation function selection in neural networks.

[LG-6] Signed Graph Autoencoder for Explainable and Polarization-Aware Network Embeddings

链接: https://arxiv.org/abs/2409.10452
作者: Nikolaos Nakis,Chrysoula Kosma,Giannis Nikolentzos,Michalis Chatzianastasis,Iakovos Evdaimon,Michalis Vazirgiannis
关键词-EN: Graph Neural Networks, garnered significant attention, Graph Archetypal Autoencoder, Graph Neural, Neural Networks
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Preprint

点击查看摘要

Abstract:Autoencoders based on Graph Neural Networks (GNNs) have garnered significant attention in recent years for their ability to extract informative latent representations, characterizing the structure of complex topologies, such as graphs. Despite the prevalence of Graph Autoencoders, there has been limited focus on developing and evaluating explainable neural-based graph generative models specifically designed for signed networks. To address this gap, we propose the Signed Graph Archetypal Autoencoder (SGAAE) framework. SGAAE extracts node-level representations that express node memberships over distinct extreme profiles, referred to as archetypes, within the network. This is achieved by projecting the graph onto a learned polytope, which governs its polarization. The framework employs a recently proposed likelihood for analyzing signed networks based on the Skellam distribution, combined with relational archetypal analysis and GNNs. Our experimental evaluation demonstrates the SGAAEs’ capability to successfully infer node memberships over the different underlying latent structures while extracting competing communities formed through the participation of the opposing views in the network. Additionally, we introduce the 2-level network polarization problem and show how SGAAE is able to characterize such a setting. The proposed model achieves high performance in different tasks of signed link prediction across four real-world datasets, outperforming several baseline models.

[LG-7] Structure-preserving learning for multi-symplectic PDEs

链接: https://arxiv.org/abs/2409.10432
作者: Süleyman Yıldız,Pawan Goyal,Peter Benner
关键词-EN: inferring reduced-order models, energy-preserving machine learning, reduced-order Hamiltonian models, energy-preserving reduced-order methods, machine learning method
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:This paper presents an energy-preserving machine learning method for inferring reduced-order models (ROMs) by exploiting the multi-symplectic form of partial differential equations (PDEs). The vast majority of energy-preserving reduced-order methods use symplectic Galerkin projection to construct reduced-order Hamiltonian models by projecting the full models onto a symplectic subspace. However, symplectic projection requires the existence of fully discrete operators, and in many cases, such as black-box PDE solvers, these operators are inaccessible. In this work, we propose an energy-preserving machine learning method that can infer the dynamics of the given PDE using data only, so that the proposed framework does not depend on the fully discrete operators. In this context, the proposed method is non-intrusive. The proposed method is grey box in the sense that it requires only some basic knowledge of the multi-symplectic model at the partial differential equation level. We prove that the proposed method satisfies spatially discrete local energy conservation and preserves the multi-symplectic conservation laws. We test our method on the linear wave equation, the Korteweg-de Vries equation, and the Zakharov-Kuznetsov equation. We test the generalization of our learned models by testing them far outside the training time interval.

[LG-8] PFL: Tsetlin-Personalized Federated Learning with Confidence-Based Clustering

链接: https://arxiv.org/abs/2409.10392
作者: Rasoul Jafari Gohari,Laya Aliahmadipour,Ezat Valipour
关键词-EN: Federated Learning, Machine Learning, Deep Learning, Tsetlin-Personalized Federated Learning, witnessed rapid
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The world of Machine Learning (ML) has witnessed rapid changes in terms of new models and ways to process users data. The majority of work that has been done is focused on Deep Learning (DL) based approaches. However, with the emergence of new algorithms such as the Tsetlin Machine ™ algorithm, there is growing interest in exploring alternative approaches that may offer unique advantages in certain domains or applications. One of these domains is Federated Learning (FL), in which users privacy is of utmost importance. Due to its novelty, FL has seen a surge in the incorporation of personalization techniques to enhance model accuracy while maintaining user privacy under personalized conditions. In this work, we propose a novel approach dubbed TPFL: Tsetlin-Personalized Federated Learning, in which models are grouped into clusters based on their confidence towards a specific class. In this way, clustering can benefit from two key advantages. Firstly, clients share only what they are confident about, resulting in the elimination of wrongful weight aggregation among clients whose data for a specific class may have not been enough during the training. This phenomenon is prevalent when the data are non-Independent and Identically Distributed (non-IID). Secondly, by sharing only weights towards a specific class, communication cost is substantially reduced, making TPLF efficient in terms of both accuracy and communication cost. The results of TPFL demonstrated the highest accuracy on three different datasets; namely MNIST, FashionMNIST and FEMNIST.

[LG-9] Revising the Structure of Recurrent Neural Networks to Eliminate Numerical Derivatives in Forming Physics Informed Loss Terms with Respect to Time

链接: https://arxiv.org/abs/2409.10388
作者: Mahyar Jahani-nasab,Mohamad Ali Bijarchi
关键词-EN: Solving unsteady partial, recurrent neural networks, typically requires numerical, requires numerical derivatives, physics informed loss
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Solving unsteady partial differential equations (PDEs) using recurrent neural networks (RNNs) typically requires numerical derivatives between each block of the RNN to form the physics informed loss function. However, this introduces the complexities of numerical derivatives into the training process of these models. In this study, we propose modifying the structure of the traditional RNN to enable the prediction of each block over a time interval, making it possible to calculate the derivative of the output with respect to time using the backpropagation algorithm. To achieve this, the time intervals of these blocks are overlapped, defining a mutual loss function between them. Additionally, the employment of conditional hidden states enables us to achieve a unique solution for each block. The forget factor is utilized to control the influence of the conditional hidden state on the prediction of the subsequent block. This new model, termed the Mutual Interval RNN (MI-RNN), is applied to solve three different benchmarks: the Burgers equation, unsteady heat conduction in an irregular domain, and the Green vortex problem. Our results demonstrate that MI-RNN can find the exact solution more accurately compared to existing RNN models. For instance, in the second problem, MI-RNN achieved one order of magnitude less relative error compared to the RNN model with numerical derivatives.

[LG-10] Learning Gentle Grasping from Human-Free Force Control Demonstration

链接: https://arxiv.org/abs/2409.10371
作者: Mingxuan Li,Lunwei Zhang,Tiemin Li,Yao Jiang
关键词-EN: gently grasp unfamiliar, unfamiliar objects based, grasp unfamiliar objects, steadily and gently, gently grasp
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 7 figures

点击查看摘要

Abstract:Humans can steadily and gently grasp unfamiliar objects based on tactile perception. Robots still face challenges in achieving similar performance due to the difficulty of learning accurate grasp-force predictions and force control strategies that can be generalized from limited data. In this article, we propose an approach for learning grasping from ideal force control demonstrations, to achieve similar performance of human hands with limited data size. Our approach utilizes objects with known contact characteristics to automatically generate reference force curves without human demonstrations. In addition, we design the dual convolutional neural networks (Dual-CNN) architecture which incorporating a physics-based mechanics module for learning target grasping force predictions from demonstrations. The described method can be effectively applied in vision-based tactile sensors and enables gentle and stable grasping of objects from the ground. The described prediction model and grasping strategy were validated in offline evaluations and online experiments, and the accuracy and generalizability were demonstrated.

[LG-11] Uncovering the Mechanism of Hepatotoxiciy of PFAS Targeting L-FABP Using GCN and Computational Modeling

链接: https://arxiv.org/abs/2409.10370
作者: Lucas Jividen,Tibo Duran,Xi-Zhi Niu,Jun Bai
关键词-EN: persistent environmental pollutants, PFAS, polyfluoroalkyl substances, bioaccumulation issues, persistent environmental
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 8 pages, 9 figures, submitted to IEEE BIBM 2024

点击查看摘要

Abstract:Per- and polyfluoroalkyl substances (PFAS) are persistent environmental pollutants with known toxicity and bioaccumulation issues. Their widespread industrial use and resistance to degradation have led to global environmental contamination and significant health concerns. While a minority of PFAS have been extensively studied, the toxicity of many PFAS remains poorly understood due to limited direct toxicological data. This study advances the predictive modeling of PFAS toxicity by combining semi-supervised graph convolutional networks (GCNs) with molecular descriptors and fingerprints. We propose a novel approach to enhance the prediction of PFAS binding affinities by isolating molecular fingerprints to construct graphs where then descriptors are set as the node features. This approach specifically captures the structural, physicochemical, and topological features of PFAS without overfitting due to an abundance of features. Unsupervised clustering then identifies representative compounds for detailed binding studies. Our results provide a more accurate ability to estimate PFAS hepatotoxicity to provide guidance in chemical discovery of new PFAS and the development of new safety regulations.

[LG-12] 2D or not 2D: How Does the Dimensionality of Gesture Representation Affect 3D Co-Speech Gesture Generation?

链接: https://arxiv.org/abs/2409.10357
作者: Téo Guichoux,Laure Soulier,Nicolas Obin,Catherine Pelachaud
关键词-EN: Embodied Conversational Agents, fundamental for communication, synchronous co-speech gestures, Co-speech gestures, Conversational Agents
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Co-speech gestures are fundamental for communication. The advent of recent deep learning techniques has facilitated the creation of lifelike, synchronous co-speech gestures for Embodied Conversational Agents. “In-the-wild” datasets, aggregating video content from platforms like YouTube via human pose detection technologies, provide a feasible solution by offering 2D skeletal sequences aligned with speech. Concurrent developments in lifting models enable the conversion of these 2D sequences into 3D gesture databases. However, it is important to note that the 3D poses estimated from the 2D extracted poses are, in essence, approximations of the ground-truth, which remains in the 2D domain. This distinction raises questions about the impact of gesture representation dimensionality on the quality of generated motions - a topic that, to our knowledge, remains largely unexplored. Our study examines the effect of using either 2D or 3D joint coordinates as training data on the performance of speech-to-gesture deep generative models. We employ a lifting model for converting generated 2D pose sequences into 3D and assess how gestures created directly in 3D stack up against those initially generated in 2D and then converted to 3D. We perform an objective evaluation using widely used metrics in the gesture generation field as well as a user study to qualitatively evaluate the different approaches.

[LG-13] Hyperedge Modeling in Hypergraph Neural Networks by using Densest Overlapping Subgraphs

链接: https://arxiv.org/abs/2409.10340
作者: Mehrad Soltani,Luis Rueda
关键词-EN: Hypergraph Neural Networks, Graph Neural Networks, traditional Graph Neural, Neural Networks, tackle the limitations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Hypergraphs tackle the limitations of traditional graphs by introducing \em hyperedges. While graph edges connect only two nodes, hyperedges connect an arbitrary number of nodes along their edges. Also, the underlying message-passing mechanisms in Hypergraph Neural Networks (HGNNs) are in the form of vertex-hyperedge-vertex, which let HGNNs capture and utilize richer and more complex structural information than traditional Graph Neural Networks (GNNs). More recently, the idea of overlapping subgraphs has emerged. These subgraphs can capture more information about subgroups of vertices without limiting one vertex belonging to just one group, allowing vertices to belong to multiple groups or subgraphs. In addition, one of the most important problems in graph clustering is to find densest overlapping subgraphs (DOS). In this paper, we propose a solution to the DOS problem via Agglomerative Greedy Enumeration (DOSAGE) algorithm as a novel approach to enhance the process of generating the densest overlapping subgraphs and, hence, a robust construction of the hypergraphs. Experiments on standard benchmarks show that the DOSAGE algorithm significantly outperforms the HGNNs and six other methods on the node classification task.

[LG-14] SEAL: Towards Safe Autonomous Driving via Skill-Enabled Adversary Learning for Closed-Loop Scenario Generation

链接: https://arxiv.org/abs/2409.10320
作者: Benjamin Stoler,Ingrid Navarro,Jonathan Francis,Jean Oh
关键词-EN: Verification and validation, autonomous driving, systems and components, increasing importance, validation of autonomous
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Verification and validation of autonomous driving (AD) systems and components is of increasing importance, as such technology increases in real-world prevalence. Safety-critical scenario generation is a key approach to robustify AD policies through closed-loop training. However, existing approaches for scenario generation rely on simplistic objectives, resulting in overly-aggressive or non-reactive adversarial behaviors. To generate diverse adversarial yet realistic scenarios, we propose SEAL, a scenario perturbation approach which leverages learned scoring functions and adversarial, human-like skills. SEAL-perturbed scenarios are more realistic than SOTA baselines, leading to improved ego task success across real-world, in-distribution, and out-of-distribution scenarios, of more than 20%. To facilitate future research, we release our code and tools: this https URL

[LG-15] How to do impactful research in artificial intelligence for chemistry and materials science

链接: https://arxiv.org/abs/2409.10304
作者: Austin Cheng,Cher Tian Ser,Marta Skreta,Andrés Guzmán-Cordero,Luca Thiede,Andreas Burger,Abdulrahman Aldossary,Shi Xuan Leong,Sergio Pablo-García,Felix Strieth-Kalthoff,Alán Aspuru-Guzik
关键词-EN: pervasively touching, Machine learning, Machine, learning, Abstract
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Machine learning has been pervasively touching many fields of science. Chemistry and materials science are no exception. While machine learning has been making a great impact, it is still not reaching its full potential or maturity. In this perspective, we first outline current applications across a diversity of problems in chemistry. Then, we discuss how machine learning researchers view and approach problems in the field. Finally, we provide our considerations for maximizing impact when researching machine learning for chemistry.

[LG-16] ReflectDiffu: Reflect between Emotion-intent Contagion and Mimicry for Empathetic Response Generation via a RL-Diffusion Framework

链接: https://arxiv.org/abs/2409.10289
作者: Jiahao Yuan,Zixiang Di,Zhiqing Cui,Guisong Yang,Usman Naseem
关键词-EN: foster meaningful interactions, Empathetic response generation, meaningful interactions, response generation necessitates, necessitates the integration
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Empathetic response generation necessitates the integration of emotional and intentional dynamics to foster meaningful interactions. Existing research either neglects the intricate interplay between emotion and intent, leading to suboptimal controllability of empathy, or resorts to large language models (LLMs), which incur significant computational overhead. In this paper, we introduce ReflectDiffu, a lightweight and comprehensive framework for empathetic response generation. This framework incorporates emotion contagion to augment emotional expressiveness and employs an emotion-reasoning mask to pinpoint critical emotional elements. Additionally, it integrates intent mimicry within reinforcement learning for refinement during diffusion. By harnessing an intent twice reflect the mechanism of Exploring-Sampling-Correcting, ReflectDiffu adeptly translates emotional decision-making into precise intent actions, thereby addressing empathetic response misalignments stemming from emotional misrecognition. Through reflection, the framework maps emotional states to intents, markedly enhancing both response empathy and flexibility. Comprehensive experiments reveal that ReflectDiffu outperforms existing models regarding relevance, controllability, and informativeness, achieving state-of-the-art results in both automatic and human evaluations.

[LG-17] Enhancing Image Classification in Small and Unbalanced Datasets through Synthetic Data Augmentation

链接: https://arxiv.org/abs/2409.10286
作者: Neil De La Fuente,Mireia Majó,Irina Luzko,Henry Córdova,Gloria Fernández-Esparrach,Jorge Bernal
关键词-EN: Accurate and robust, present high imbalance, medical image classification, robust medical image, class-specific Variational Autoencoders
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate and robust medical image classification is a challenging task, especially in application domains where available annotated datasets are small and present high imbalance between target classes. Considering that data acquisition is not always feasible, especially for underrepresented classes, our approach introduces a novel synthetic augmentation strategy using class-specific Variational Autoencoders (VAEs) and latent space interpolation to improve discrimination capabilities. By generating realistic, varied synthetic data that fills feature space gaps, we address issues of data scarcity and class imbalance. The method presented in this paper relies on the interpolation of latent representations within each class, thus enriching the training set and improving the model’s generalizability and diagnostic accuracy. The proposed strategy was tested in a small dataset of 321 images created to train and validate an automatic method for assessing the quality of cleanliness of esophagogastroduodenoscopy images. By combining real and synthetic data, an increase of over 18% in the accuracy of the most challenging underrepresented class was observed. The proposed strategy not only benefited the underrepresented class but also led to a general improvement in other metrics, including a 6% increase in global accuracy and precision. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2409.10286 [cs.CV] (or arXiv:2409.10286v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.10286 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-18] BAFNet: Bilateral Attention Fusion Network for Lightweight Semantic Segmentation of Urban Remote Sensing Images

链接: https://arxiv.org/abs/2409.10269
作者: Wentao Wang,Xili Wang
关键词-EN: Large-scale semantic segmentation, limited sample sizes, Large-scale semantic, achieve high performance, achieve high
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large-scale semantic segmentation networks often achieve high performance, while their application can be challenging when faced with limited sample sizes and computational resources. In scenarios with restricted network size and computational complexity, models encounter significant challenges in capturing long-range dependencies and recovering detailed information in images. We propose a lightweight bilateral semantic segmentation network called bilateral attention fusion network (BAFNet) to efficiently segment high-resolution urban remote sensing images. The model consists of two paths, namely dependency path and remote-local path. The dependency path utilizes large kernel attention to acquire long-range dependencies in the image. Besides, multi-scale local attention and efficient remote attention are designed to construct remote-local path. Finally, a feature aggregation module is designed to effectively utilize the different features of the two paths. Our proposed method was tested on public high-resolution urban remote sensing datasets Vaihingen and Potsdam, with mIoU reaching 83.20% and 86.53%, respectively. As a lightweight semantic segmentation model, BAFNet not only outperforms advanced lightweight models in accuracy but also demonstrates comparable performance to non-lightweight state-of-the-art methods on two datasets, despite a tenfold variance in floating-point operations and a fifteenfold difference in network parameters.

[LG-19] Enhancing Personalized Recipe Recommendation Through Multi-Class Classification

链接: https://arxiv.org/abs/2409.10267
作者: Harish Neelam,Koushik Sai Veerella
关键词-EN: diverse culinary preferences, intends to address, address the challenge, realm of diverse, association analysis
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper intends to address the challenge of personalized recipe recommendation in the realm of diverse culinary preferences. The problem domain involves recipe recommendations, utilizing techniques such as association analysis and classification. Association analysis explores the relationships and connections between different ingredients to enhance the user experience. Meanwhile, the classification aspect involves categorizing recipes based on user-defined ingredients and preferences. A unique aspect of the paper is the consideration of recipes and ingredients belonging to multiple classes, recognizing the complexity of culinary combinations. This necessitates a sophisticated approach to classification and recommendation, ensuring the system accommodates the nature of recipe categorization. The paper seeks not only to recommend recipes but also to explore the process involved in achieving accurate and personalized recommendations.

[LG-20] Hierarchical Graph Pooling Based on Minimum Description Length

链接: https://arxiv.org/abs/2409.10263
作者: Jan von Pichowski,Christopher Blöcker,Ingo Scholtes
关键词-EN: graph representation learning, deep graph representation, representation learning, essential part, part of deep
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Graph pooling is an essential part of deep graph representation learning. We introduce MapEqPool, a principled pooling operator that takes the inherent hierarchical structure of real-world graphs into account. MapEqPool builds on the map equation, an information-theoretic objective function for community detection based on the minimum description length principle which naturally implements Occam’s razor and balances between model complexity and fit. We demonstrate MapEqPool’s competitive performance with an empirical comparison against various baselines across standard graph classification datasets.

[LG-21] Hedging Is Not All You Need: A Simple Baseline for Online Learning Under Haphazard Inputs

链接: https://arxiv.org/abs/2409.10242
作者: Himanshu Buckchash,Momojit Biswas,Rohit Agarwal,Dilip K. Prasad
关键词-EN: Handling haphazard streaming, haphazard streaming data, Handling haphazard, edge devices, haphazard streaming
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Handling haphazard streaming data, such as data from edge devices, presents a challenging problem. Over time, the incoming data becomes inconsistent, with missing, faulty, or new inputs reappearing. Therefore, it requires models that are reliable. Recent methods to solve this problem depend on a hedging-based solution and require specialized elements like auxiliary dropouts, forked architectures, and intricate network design. We observed that hedging can be reduced to a special case of weighted residual connection; this motivated us to approximate it with plain self-attention. In this work, we propose HapNet, a simple baseline that is scalable, does not require online backpropagation, and is adaptable to varying input types. All present methods are restricted to scaling with a fixed window; however, we introduce a more complex problem of scaling with a variable window where the data becomes positionally uncorrelated, and cannot be addressed by present methods. We demonstrate that a variant of the proposed approach can work even for this complex scenario. We extensively evaluated the proposed approach on five benchmarks and found competitive performance.

[LG-22] Safety-Oriented Pruning and Interpretation of Reinforcement Learning Policies

链接: https://arxiv.org/abs/2409.10218
作者: Dennis Gross,Helge Spieker
关键词-EN: safe reinforcement learning, risks removing vital, removing vital parameters, Pruning neural networks, reinforcement learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pruning neural networks (NNs) can streamline them but risks removing vital parameters from safe reinforcement learning (RL) policies. We introduce an interpretable RL method called VERINTER, which combines NN pruning with model checking to ensure interpretable RL safety. VERINTER exactly quantifies the effects of pruning and the impact of neural connections on complex safety properties by analyzing changes in safety measurements. This method maintains safety in pruned RL policies and enhances understanding of their safety dynamics, which has proven effective in multiple RL settings.

[LG-23] Embedded Image-to-Image Translation for Efficient Sim-to-Real Transfer in Learning-based Robot-Assisted Soft Manipulation MICRO

链接: https://arxiv.org/abs/2409.10204
作者: Jacinto Colan,Keisuke Sugita,Ana Davila,Yutaro Yamada,Yasuhisa Hasegawa
关键词-EN: accelerating learning complex, Recent advances, shown impressive results, learning complex manipulation, complex manipulation skills
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted at 2024 IEEE International Symposium on Micro-NanoMechatronics and Human Science

点击查看摘要

Abstract:Recent advances in robotic learning in simulation have shown impressive results in accelerating learning complex manipulation skills. However, the sim-to-real gap, caused by discrepancies between simulation and reality, poses significant challenges for the effective deployment of autonomous surgical systems. We propose a novel approach utilizing image translation models to mitigate domain mismatches and facilitate efficient robot skill learning in a simulated environment. Our method involves the use of contrastive unpaired Image-to-image translation, allowing for the acquisition of embedded representations from these transformed images. Subsequently, these embeddings are used to improve the efficiency of training surgical manipulation models. We conducted experiments to evaluate the performance of our approach, demonstrating that it significantly enhances task success rates and reduces the steps required for task completion compared to traditional methods. The results indicate that our proposed system effectively bridges the sim-to-real gap, providing a robust framework for advancing the autonomy of surgical robots in minimally invasive procedures.

[LG-24] Efficient Milling Quality Prediction with Explainable Machine Learning

链接: https://arxiv.org/abs/2409.10203
作者: Dennis Gross,Helge Spieker,Arnaud Gotlieb,Ricardo Knoblauch,Mohamed Elmansori
关键词-EN: predicting surface roughness, approach for predicting, paper presents, predicting surface, explainable machine learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents an explainable machine learning (ML) approach for predicting surface roughness in milling. Utilizing a dataset from milling aluminum alloy 2017A, the study employs random forest regression models and feature importance techniques. The key contributions include developing ML models that accurately predict various roughness values and identifying redundant sensors, particularly those for measuring normal cutting force. Our experiments show that removing certain sensors can reduce costs without sacrificing predictive accuracy, highlighting the potential of explainable machine learning to improve cost-effectiveness in machining.

[LG-25] Enhancing RL Safety with Counterfactual LLM Reasoning

链接: https://arxiv.org/abs/2409.10188
作者: Dennis Gross,Helge Spieker
关键词-EN: Reinforcement learning, exhibit unsafe behavior, policies may exhibit, exhibit unsafe, unsafe behavior
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) policies may exhibit unsafe behavior and are hard to explain. We use counterfactual large language model reasoning to enhance RL policy safety post-training. We show that our approach improves and helps to explain the RL policy safety.

[LG-26] CDformer-based Momentum Transfer Model for Long-term Sports Prediction

链接: https://arxiv.org/abs/2409.10176
作者: Hui Liu,Jiacheng Gu,Xiyuan Huang,Junjie Shi,Tongtong Feng,Ning He
关键词-EN: scientific competition tactics, developing effective training, effective training strategies, Accurate sports prediction, Accurate sports
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: Under reviewing

点击查看摘要

Abstract:Accurate sports prediction is a crucial skill for professional coaches, which can assist in developing effective training strategies and scientific competition tactics. Traditional methods often use complex mathematical statistical techniques to boost predictability, but this often is limited by dataset scale and has difficulty handling long-term predictions with variable distributions, notably underperforming when predicting point-set-game multi-level matches. To deal with this challenge, this paper proposes TM2, a TCDformer-based Momentum Transfer Model for long-term sports prediction, which encompasses a momentum encoding module and a prediction module based on momentum transfer. TM2 initially encodes momentum in large-scale unstructured time series using the local linear scaling approximation (LLSA) module. Then it decomposes the reconstructed time series with momentum transfer into trend and seasonal components. The final prediction results are derived from the additive combination of a multilayer perceptron (MLP) for predicting trend components and wavelet attention mechanisms for seasonal components. Comprehensive experimental results show that on the 2023 Wimbledon men’s tournament datasets, TM2 significantly surpasses existing sports prediction models in terms of performance, reducing MSE by 61.64% and MAE by 63.64%.

[LG-27] Safe and Stable Closed-Loop Learning for Neural-Network-Supported Model Predictive Control

链接: https://arxiv.org/abs/2409.10171
作者: Sebastian Hirt,Maik Pfefferkorn,Rolf Findeisen
关键词-EN: policies remains challenging, control policies remains, remains challenging, control policies, optimal control
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 7 pages, 2 figures, accepted for CDC 2024

点击查看摘要

Abstract:Safe learning of control policies remains challenging, both in optimal control and reinforcement learning. In this article, we consider safe learning of parametrized predictive controllers that operate with incomplete information about the underlying process. To this end, we employ Bayesian optimization for learning the best parameters from closed-loop data. Our method focuses on the system’s overall long-term performance in closed-loop while keeping it safe and stable. Specifically, we parametrize the stage cost function of an MPC using a feedforward neural network. This allows for a high degree of flexibility, enabling the system to achieve a better closed-loop performance with respect to a superordinate measure. However, this flexibility also necessitates safety measures, especially with respect to closed-loop stability. To this end, we explicitly incorporated stability information in the Bayesian-optimization-based learning procedure, thereby achieving rigorous probabilistic safety guarantees. The proposed approach is illustrated using a numeric example.

[LG-28] Quantile Regression for Distributional Reward Models in RLHF

链接: https://arxiv.org/abs/2409.10164
作者: Nicolai Dorka
关键词-EN: aligning large language, large language models, RLHF, aligning large, large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) has become a key method for aligning large language models (LLMs) with human preferences through the use of reward models. However, traditional reward models typically generate point estimates, which oversimplify the diversity and complexity of human values and preferences. In this paper, we introduce Quantile Reward Models (QRMs), a novel approach to reward modeling that learns a distribution over rewards instead of a single scalar value. Our method uses quantile regression to estimate a full, potentially multimodal distribution over preferences, providing a more powerful and nuanced representation of preferences. This distributional approach can better capture the diversity of human values, addresses label noise, and accommodates conflicting preferences by modeling them as distinct modes in the distribution. Our experimental results show that QRM outperforms comparable traditional point-estimate models on RewardBench. Furthermore, we demonstrate that the additional information provided by the distributional estimates can be utilized in downstream applications, such as risk-aware reinforcement learning, resulting in LLM policies that generate fewer extremely negative responses. Our code and model are released at this https URL.

[LG-29] SplatSim: Zero-Shot Sim2Real Transfer of RGB Manipulation Policies Using Gaussian Splatting

链接: https://arxiv.org/abs/2409.10161
作者: Mohammad Nomaan Qureshi,Sparsh Garg,Francisco Yandun,David Held,George Kantor,Abhishesh Silwal
关键词-EN: significant domain shift, RGB images, relying on RGB, manipulation policies relying, remains a critical
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sim2Real transfer, particularly for manipulation policies relying on RGB images, remains a critical challenge in robotics due to the significant domain shift between synthetic and real-world visual data. In this paper, we propose SplatSim, a novel framework that leverages Gaussian Splatting as the primary rendering primitive to reduce the Sim2Real gap for RGB-based manipulation policies. By replacing traditional mesh representations with Gaussian Splats in simulators, SplatSim produces highly photorealistic synthetic data while maintaining the scalability and cost-efficiency of simulation. We demonstrate the effectiveness of our framework by training manipulation policies within SplatSimand deploying them in the real world in a zero-shot manner, achieving an average success rate of 86.25%, compared to 97.5% for policies trained on real-world data.

[LG-30] Efficient Network Embedding by Approximate Equitable Partitions ICDM2024

链接: https://arxiv.org/abs/2409.10160
作者: Giuseppe Squillace,Mirco Tribastone,Max Tschaikowski,Andrea Vandin
关键词-EN: Structural network embedding, enabling effective downstream, effective downstream tasks, similarities among nodes, Structural network
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Accepted at ICDM 2024

点击查看摘要

Abstract:Structural network embedding is a crucial step in enabling effective downstream tasks for complex systems that aims to project a network into a lower-dimensional space while preserving similarities among nodes. We introduce a simple and efficient embedding technique based on approximate variants of equitable partitions. The approximation consists in introducing a user-tunable tolerance parameter relaxing the otherwise strict condition for exact equitable partitions that can be hardly found in real-world networks. We exploit a relationship between equitable partitions and equivalence relations for Markov chains and ordinary differential equations to develop a partition refinement algorithm for computing an approximate equitable partition in polynomial time. We compare our method against state-of-the-art embedding techniques on benchmark networks. We report comparable – when not superior – performance for visualization, classification, and regression tasks at a cost between one and three orders of magnitude smaller using a prototype implementation, enabling the embedding of large-scale networks which could not be efficiently handled by most of the competing techniques.

[LG-31] Contrastive Learning for Character Detection in Ancient Greek Papyri

链接: https://arxiv.org/abs/2409.10156
作者: Vedasri Nakka,Andreas Fischer,Rolf Ingold,Lars Vogtlin
关键词-EN: ICDAR dataset, Greek letter recognition, dataset, SimCLR, Greek letter
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This thesis investigates the effectiveness of SimCLR, a contrastive learning technique, in Greek letter recognition, focusing on the impact of various augmentation techniques. We pretrain the SimCLR backbone using the Alpub dataset (pretraining dataset) and fine-tune it on a smaller ICDAR dataset (finetuning dataset) to compare SimCLR’s performance against traditional baseline models, which use cross-entropy and triplet loss functions. Additionally, we explore the role of different data augmentation strategies, essential for the SimCLR training process. Methodologically, we examine three primary approaches: (1) a baseline model using cross-entropy loss, (2) a triplet embedding model with a classification layer, and (3) a SimCLR pretrained model with a classification layer. Initially, we train the baseline, triplet, and SimCLR models using 93 augmentations on ResNet-18 and ResNet-50 networks with the ICDAR dataset. From these, the top four augmentations are selected using a statistical t-test. Pretraining of SimCLR is conducted on the Alpub dataset, followed by fine-tuning on the ICDAR dataset. The triplet loss model undergoes a similar process, being pretrained on the top four augmentations before fine-tuning on ICDAR. Our experiments show that SimCLR does not outperform the baselines in letter recognition tasks. The baseline model with cross-entropy loss demonstrates better performance than both SimCLR and the triplet loss model. This study provides a detailed evaluation of contrastive learning for letter recognition, highlighting SimCLR’s limitations while emphasizing the strengths of traditional supervised learning models in this task. We believe SimCLR’s cropping strategies may cause a semantic shift in the input image, reducing training effectiveness despite the large pretraining dataset. Our code is available at this https URL.

[LG-32] AALF: Almost Always Linear Forecasting

链接: https://arxiv.org/abs/2409.10142
作者: Matthias Jakobs,Thomas Liebig
关键词-EN: high predictive power, Deep Learning, Deep Learning models, Deep Learning method, Deep Learning approaches
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent works for time-series forecasting more and more leverage the high predictive power of Deep Learning models. With this increase in model complexity, however, comes a lack in understanding of the underlying model decision process, which is problematic for high-stakes decision making. At the same time, simple, interpretable forecasting methods such as Linear Models can still perform very well, sometimes on-par, with Deep Learning approaches. We argue that simple models are good enough most of the time, and forecasting performance can be improved by choosing a Deep Learning method only for certain predictions, increasing the overall interpretability of the forecasting process. In this context, we propose a novel online model selection framework which uses meta-learning to identify these predictions and only rarely uses a non-interpretable, large model. An extensive empirical study on various real-world datasets shows that our selection methodology outperforms state-of-the-art online model selections methods in most cases. We find that almost always choosing a simple Linear Model for forecasting results in competitive performance, suggesting that the need for opaque black-box models in time-series forecasting is smaller than recent works would suggest.

[LG-33] Evaluating the Efficacy of Instance Incremental vs. Batch Learning in Delayed Label Environments: An Empirical Study on Tabular Data Streaming for Fraud Detection

链接: https://arxiv.org/abs/2409.10111
作者: Kodjo Mawuena Amekoe,Mustapha Lebbah,Gregoire Jaffre,Hanene Azzag,Zaineb Chelly Dagdia
关键词-EN: typically involve evolving, evolving data streams, involve evolving data, data arrives continuously, production scenarios typically
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Neural and Evolutionary Computing (cs.NE)
*备注: 20 pages

点击查看摘要

Abstract:Real-world tabular learning production scenarios typically involve evolving data streams, where data arrives continuously and its distribution may change over time. In such a setting, most studies in the literature regarding supervised learning favor the use of instance incremental algorithms due to their ability to adapt to changes in the data distribution. Another significant reason for choosing these algorithms is \textitavoid storing observations in memory as commonly done in batch incremental settings. However, the design of instance incremental algorithms often assumes immediate availability of labels, which is an optimistic assumption. In many real-world scenarios, such as fraud detection or credit scoring, labels may be delayed. Consequently, batch incremental algorithms are widely used in many real-world tasks. This raises an important question: “In delayed settings, is instance incremental learning the best option regarding predictive performance and computational efficiency?” Unfortunately, this question has not been studied in depth, probably due to the scarcity of real datasets containing delayed information. In this study, we conduct a comprehensive empirical evaluation and analysis of this question using a real-world fraud detection problem and commonly used generated datasets. Our findings indicate that instance incremental learning is not the superior option, considering on one side state-of-the-art models such as Adaptive Random Forest (ARF) and other side batch learning models such as XGBoost. Additionally, when considering the interpretability of the learning systems, batch incremental solutions tend to be favored. Code: \urlthis https URL

[LG-34] A Comparative Study of Open Source Computer Vision Models for Application on Small Data: The Case of CFRP Tape Laying

链接: https://arxiv.org/abs/2409.10104
作者: Thomas Fraunholz,Dennis Rall,Tim Köhler,Alfons Schuster,Monika Mayer,Lars Larsen
关键词-EN: Artificial Intelligence, automating existing processes, increasing role, materials and techniques, realm of industrial
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the realm of industrial manufacturing, Artificial Intelligence (AI) is playing an increasing role, from automating existing processes to aiding in the development of new materials and techniques. However, a significant challenge arises in smaller, experimental processes characterized by limited training data availability, questioning the possibility to train AI models in such small data contexts. In this work, we explore the potential of Transfer Learning to address this challenge, specifically investigating the minimum amount of data required to develop a functional AI model. For this purpose, we consider the use case of quality control of Carbon Fiber Reinforced Polymer (CFRP) tape laying in aerospace manufacturing using optical sensors. We investigate the behavior of different open-source computer vision models with a continuous reduction of the training data. Our results show that the amount of data required to successfully train an AI model can be drastically reduced, and the use of smaller models does not necessarily lead to a loss of performance.

[LG-35] Robust Reinforcement Learning with Dynamic Distortion Risk Measures

链接: https://arxiv.org/abs/2409.10096
作者: Anthony Coache,Sebastian Jaimungal
关键词-EN: optimal strategy heavily, strategy heavily depends, agent optimal strategy, reinforcement learning, optimal strategy
类目: Machine Learning (cs.LG); Computational Finance (q-fin.CP); Portfolio Management (q-fin.PM); Risk Management (q-fin.RM); Machine Learning (stat.ML)
*备注: 29 pages, 3 figures

点击查看摘要

Abstract:In a reinforcement learning (RL) setting, the agent’s optimal strategy heavily depends on her risk preferences and the underlying model dynamics of the training environment. These two aspects influence the agent’s ability to make well-informed and time-consistent decisions when facing testing environments. In this work, we devise a framework to solve robust risk-aware RL problems where we simultaneously account for environmental uncertainty and risk with a class of dynamic robust distortion risk measures. Robustness is introduced by considering all models within a Wasserstein ball around a reference model. We estimate such dynamic robust risk measures using neural networks by making use of strictly consistent scoring functions, derive policy gradient formulae using the quantile representation of distortion risk measures, and construct an actor-critic algorithm to solve this class of robust risk-aware RL problems. We demonstrate the performance of our algorithm on a portfolio allocation example.

[LG-36] DDoS: Diffusion Distribution Similarity for Out-of-Distribution Detection

链接: https://arxiv.org/abs/2409.10094
作者: Kun Fang,Qinghua Tao,Zuopeng Yang,Xiaolin Huang,Jie Yang
关键词-EN: distribution disparities, distribution, perceptual metrics, OoD, training distribution
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Out-of-Distribution (OoD) detection determines whether the given samples are from the training distribution of the classifier-under-protection, i.e., the In-Distribution (InD), or from a different OoD. Latest researches introduce diffusion models pre-trained on InD data to advocate OoD detection by transferring an OoD image into a generated one that is close to InD, so that one could capture the distribution disparities between original and generated images to detect OoD data. Existing diffusion-based detectors adopt perceptual metrics on the two images to measure such disparities, but ignore a fundamental fact: Perceptual metrics are devised essentially for human-perceived similarities of low-level image patterns, e.g., textures and colors, and are not advisable in evaluating distribution disparities, since images with different low-level patterns could possibly come from the same distribution. To address this issue, we formulate a diffusion-based detection framework that considers the distribution similarity between a tested image and its generated counterpart via a novel proper similarity metric in the informative feature space and probability space learned by the classifier-under-protection. An anomaly-removal strategy is further presented to enlarge such distribution disparities by removing abnormal OoD information in the feature space to facilitate the detection. Extensive empirical results unveil the insufficiency of perceptual metrics and the effectiveness of our distribution similarity framework with new state-of-the-art detection performance.

[LG-37] A Riemannian Approach to Ground Metric Learning for Optimal Transport

链接: https://arxiv.org/abs/2409.10085
作者: Pratik Jawanpuria,Dai Shi,Bamdev Mishra,Junbin Gao
关键词-EN: signal processing applications, Optimal transport, target data points, theory has attracted, processing applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Optimal transport (OT) theory has attracted much attention in machine learning and signal processing applications. OT defines a notion of distance between probability distributions of source and target data points. A crucial factor that influences OT-based distances is the ground metric of the embedding space in which the source and target data points lie. In this work, we propose to learn a suitable latent ground metric parameterized by a symmetric positive definite matrix. We use the rich Riemannian geometry of symmetric positive definite matrices to jointly learn the OT distance along with the ground metric. Empirical results illustrate the efficacy of the learned metric in OT-based domain adaptation.

[LG-38] Steinmetz Neural Networks for Complex-Valued Data

链接: https://arxiv.org/abs/2409.10075
作者: Shyam Venkatasubramanian,Ali Pezeshki,Vahid Tarokh
关键词-EN: Steinmetz Neural Networks, processing complex-valued data, parallel real-valued subnetworks, Analytic Neural Network, Steinmetz Neural
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:In this work, we introduce a new approach to processing complex-valued data using DNNs consisting of parallel real-valued subnetworks with coupled outputs. Our proposed class of architectures, referred to as Steinmetz Neural Networks, leverages multi-view learning to construct more interpretable representations within the latent space. Subsequently, we present the Analytic Neural Network, which implements a consistency penalty that encourages analytic signal representations in the Steinmetz neural network’s latent space. This penalty enforces a deterministic and orthogonal relationship between the real and imaginary components. Utilizing an information-theoretic construction, we demonstrate that the upper bound on the generalization error posited by the analytic neural network is lower than that of the general class of Steinmetz neural networks. Our numerical experiments demonstrate the improved performance and robustness to additive noise, afforded by our proposed networks on benchmark datasets and synthetic examples.

[LG-39] Enhancing Anomaly Detection via Generating Diversified and Hard-to-distinguish Synthetic Anomalies CIKM2024

链接: https://arxiv.org/abs/2409.10069
作者: Hyuntae Kim,Changhee Lee
关键词-EN: identify unseen anomalies, Unsupervised anomaly detection, Unsupervised anomaly, daunting task, normal samples
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at CIKM 2024

点击查看摘要

Abstract:Unsupervised anomaly detection is a daunting task, as it relies solely on normality patterns from the training data to identify unseen anomalies during testing. Recent approaches have focused on leveraging domain-specific transformations or perturbations to generate synthetic anomalies from normal samples. The objective here is to acquire insights into normality patterns by learning to differentiate between normal samples and these crafted anomalies. However, these approaches often encounter limitations when domain-specific transformations are not well-specified such as in tabular data, or when it becomes trivial to distinguish between them. To address these issues, we introduce a novel domain-agnostic method that employs a set of conditional perturbators and a discriminator. The perturbators are trained to generate input-dependent perturbations, which are subsequently utilized to construct synthetic anomalies, and the discriminator is trained to distinguish normal samples from them. We ensure that the generated anomalies are both diverse and hard to distinguish through two key strategies: i) directing perturbations to be orthogonal to each other and ii) constraining perturbations to remain in proximity to normal samples. Throughout experiments on real-world datasets, we demonstrate the superiority of our method over state-of-the-art benchmarks, which is evident not only in image data but also in tabular data, where domain-specific transformation is not readily accessible. Additionally, we empirically confirm the adaptability of our method to semi-supervised settings, demonstrating its capacity to incorporate supervised signals to enhance anomaly detection performance even further.

[LG-40] Spatiotemporal Covariance Neural Networks KDD ECML

链接: https://arxiv.org/abs/2409.10068
作者: Andrea Cavallo,Mohammad Sabbaqi,Elvin Isufi
关键词-EN: Modeling spatiotemporal interactions, Modeling spatiotemporal, unknown structure, multivariate time series, coVariance Neural Network
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD) 2024

点击查看摘要

Abstract:Modeling spatiotemporal interactions in multivariate time series is key to their effective processing, but challenging because of their irregular and often unknown structure. Statistical properties of the data provide useful biases to model interdependencies and are leveraged by correlation and covariance-based networks as well as by processing pipelines relying on principal component analysis (PCA). However, PCA and its temporal extensions suffer instabilities in the covariance eigenvectors when the corresponding eigenvalues are close to each other, making their application to dynamic and streaming data settings challenging. To address these issues, we exploit the analogy between PCA and graph convolutional filters to introduce the SpatioTemporal coVariance Neural Network (STVNN), a relational learning model that operates on the sample covariance matrix of the time series and leverages joint spatiotemporal convolutions to model the data. To account for the streaming and non-stationary setting, we consider an online update of the parameters and sample covariance matrix. We prove the STVNN is stable to the uncertainties introduced by these online estimations, thus improving over temporal PCA-based methods. Experimental results corroborate our theoretical findings and show that STVNN is competitive for multivariate time series processing, it adapts to changes in the data distribution, and it is orders of magnitude more stable than online temporal PCA.

[LG-41] Global Lightning-Ignited Wildfires Prediction and Climate Change Projections based on Explainable Machine Learning Models

链接: https://arxiv.org/abs/2409.10046
作者: Assaf Shmuel,Teddy Lazebnik,Oren Glickman,Eyal Heifetz,Colin Price
关键词-EN: lightning-ignited wildfires, significant natural disaster, natural disaster risk, Wildfires, natural disaster
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Wildfires pose a significant natural disaster risk to populations and contribute to accelerated climate change. As wildfires are also affected by climate change, extreme wildfires are becoming increasingly frequent. Although they occur less frequently globally than those sparked by human activities, lightning-ignited wildfires play a substantial role in carbon emissions and account for the majority of burned areas in certain regions. While existing computational models, especially those based on machine learning, aim to predict lightning-ignited wildfires, they are typically tailored to specific regions with unique characteristics, limiting their global applicability. In this study, we present machine learning models designed to characterize and predict lightning-ignited wildfires on a global scale. Our approach involves classifying lightning-ignited versus anthropogenic wildfires, and estimating with high accuracy the probability of lightning to ignite a fire based on a wide spectrum of factors such as meteorological conditions and vegetation. Utilizing these models, we analyze seasonal and spatial trends in lightning-ignited wildfires shedding light on the impact of climate change on this phenomenon. We analyze the influence of various features on the models using eXplainable Artificial Intelligence (XAI) frameworks. Our findings highlight significant global differences between anthropogenic and lightning-ignited wildfires. Moreover, we demonstrate that, even over a short time span of less than a decade, climate changes have steadily increased the global risk of lightning-ignited wildfires. This distinction underscores the imperative need for dedicated predictive models and fire weather indices tailored specifically to each type of wildfire.

[LG-42] Learning Latent Wireless Dynamics from Channel State Information

链接: https://arxiv.org/abs/2409.10045
作者: Charbel Bou Chaaya,Abanoub M. Girgis,Mehdi Bennis
关键词-EN: data-driven machine learning, wireless propagation environment, machine learning, data-driven machine, propagation environment
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In this work, we propose a novel data-driven machine learning (ML) technique to model and predict the dynamics of the wireless propagation environment in latent space. Leveraging the idea of channel charting, which learns compressed representations of high-dimensional channel state information (CSI), we incorporate a predictive component to capture the dynamics of the wireless system. Hence, we jointly learn a channel encoder that maps the estimated CSI to an appropriate latent space, and a predictor that models the relationships between such representations. Accordingly, our problem boils down to training a joint-embedding predictive architecture (JEPA) that simulates the latent dynamics of a wireless network from CSI. We present numerical evaluations on measured data and show that the proposed JEPA displays a two-fold increase in accuracy over benchmarks, for longer look-ahead prediction tasks.

[LG-43] Benchmarking Large Language Model Uncertainty for Prompt Optimization

链接: https://arxiv.org/abs/2409.10044
作者: Pei-Fu Guo,Yun-Da Tsai,Shou-De Lin
关键词-EN: Large Language Models, Large Language, effective uncertainty estimation, lack effective uncertainty, algorithms for Large
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Prompt optimization algorithms for Large Language Models (LLMs) excel in multi-step reasoning but still lack effective uncertainty estimation. This paper introduces a benchmark dataset to evaluate uncertainty metrics, focusing on Answer, Correctness, Aleatoric, and Epistemic Uncertainty. Through analysis of models like GPT-3.5-Turbo and Meta-Llama-3.1-8B-Instruct, we show that current metrics align more with Answer Uncertainty, which reflects output confidence and diversity, rather than Correctness Uncertainty, highlighting the need for improved metrics that are optimization-objective-aware to better guide prompt optimization. Our code and dataset are available at this https URL.

[LG-44] On the Diagram of Thought

链接: https://arxiv.org/abs/2409.10038
作者: Yifan Zhang,Yang Yuan,Andrew Chi-Chih Yao
关键词-EN: directed acyclic graph, acyclic graph, cohesive DAG structure, Diagram of Thought, directed acyclic
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Diagram of Thought (DoT), a framework that models iterative reasoning in large language models (LLMs) as the construction of a directed acyclic graph (DAG) within a single model. Unlike traditional approaches that represent reasoning as linear chains or trees, DoT organizes propositions, critiques, refinements, and verifications into a cohesive DAG structure, allowing the model to explore complex reasoning pathways while maintaining logical consistency. Each node in the diagram corresponds to a proposition that has been proposed, critiqued, refined, or verified, enabling the LLM to iteratively improve its reasoning through natural language feedback. By leveraging auto-regressive next-token prediction with role-specific tokens, DoT facilitates seamless transitions between proposing ideas and critically evaluating them, providing richer feedback than binary signals. Furthermore, we formalize the DoT framework using Topos Theory, providing a mathematical foundation that ensures logical consistency and soundness in the reasoning process. This approach enhances both the training and inference processes within a single LLM, eliminating the need for multiple models or external control mechanisms. DoT offers a conceptual framework for designing next-generation reasoning-specialized models, emphasizing training efficiency, robust reasoning capabilities, and theoretical grounding. The code is available at this https URL.

[LG-45] FreeMark: A Non-Invasive White-Box Watermarking for Deep Neural Networks

链接: https://arxiv.org/abs/2409.09996
作者: Yuzhang Chen,Jiangnan Zhu,Yujie Gu,Minoru Kuribayashi,Kouichi Sakurai
关键词-EN: Deep neural networks, achieved significant success, Deep neural, neural networks, real-world applications
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) have achieved significant success in real-world applications. However, safeguarding their intellectual property (IP) remains extremely challenging. Existing DNN watermarking for IP protection often require modifying DNN models, which reduces model performance and limits their practicality. This paper introduces FreeMark, a novel DNN watermarking framework that leverages cryptographic principles without altering the original host DNN model, thereby avoiding any reduction in model performance. Unlike traditional DNN watermarking methods, FreeMark innovatively generates secret keys from a pre-generated watermark vector and the host model using gradient descent. These secret keys, used to extract watermark from the model’s activation values, are securely stored with a trusted third party, enabling reliable watermark extraction from suspect models. Extensive experiments demonstrate that FreeMark effectively resists various watermark removal attacks while maintaining high watermark capacity. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2409.09996 [cs.CR] (or arXiv:2409.09996v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2409.09996 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-46] SHIRE: Enhancing Sample Efficiency using Human Intuition in REinforcement Learning

链接: https://arxiv.org/abs/2409.09990
作者: Amogh Joshi,Adarsh Kumar Kosta,Kaushik Roy
关键词-EN: optical flow estimation, perform robotic perception, Deep Reinforcement Learning, automatic control, flow estimation
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:The ability of neural networks to perform robotic perception and control tasks such as depth and optical flow estimation, simultaneous localization and mapping (SLAM), and automatic control has led to their widespread adoption in recent years. Deep Reinforcement Learning has been used extensively in these settings, as it does not have the unsustainable training costs associated with supervised learning. However, DeepRL suffers from poor sample efficiency, i.e., it requires a large number of environmental interactions to converge to an acceptable solution. Modern RL algorithms such as Deep Q Learning and Soft Actor-Critic attempt to remedy this shortcoming but can not provide the explainability required in applications such as autonomous robotics. Humans intuitively understand the long-time-horizon sequential tasks common in robotics. Properly using such intuition can make RL policies more explainable while enhancing their sample efficiency. In this work, we propose SHIRE, a novel framework for encoding human intuition using Probabilistic Graphical Models (PGMs) and using it in the Deep RL training pipeline to enhance sample efficiency. Our framework achieves 25-78% sample efficiency gains across the environments we evaluate at negligible overhead cost. Additionally, by teaching RL agents the encoded elementary behavior, SHIRE enhances policy explainability. A real-world demonstration further highlights the efficacy of policies trained using our framework.

[LG-47] Convergence of Sharpness-Aware Minimization Algorithms using Increasing Batch Size and Decaying Learning Rate

链接: https://arxiv.org/abs/2409.09984
作者: Hinata Harada,Hideaki Iiduka
关键词-EN: including gap guided, gap guided SAM, deep neural network, neural network models, finding flat local
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:The sharpness-aware minimization (SAM) algorithm and its variants, including gap guided SAM (GSAM), have been successful at improving the generalization capability of deep neural network models by finding flat local minima of the empirical loss in training. Meanwhile, it has been shown theoretically and practically that increasing the batch size or decaying the learning rate avoids sharp local minima of the empirical loss. In this paper, we consider the GSAM algorithm with increasing batch sizes or decaying learning rates, such as cosine annealing or linear learning rate, and theoretically show its convergence. Moreover, we numerically compare SAM (GSAM) with and without an increasing batch size and conclude that using an increasing batch size or decaying learning rate finds flatter local minima than using a constant batch size and learning rate.

[LG-48] From Bytes to Bites: Using Country Specific Machine Learning Models to Predict Famine

链接: https://arxiv.org/abs/2409.09980
作者: Salloni Kapoor,Simeon Sayer
关键词-EN: issues affecting millions, critical global issues, global issues affecting, affecting millions, Hunger crises
类目: Machine Learning (cs.LG)
*备注: 17 pages, 7 figures, 2 tables

点击查看摘要

Abstract:Hunger crises are critical global issues affecting millions, particularly in low-income and developing countries. This research investigates how machine learning can be utilized to predict and inform decisions regarding famine and hunger crises. By leveraging a diverse set of variables (natural, economic, and conflict-related), three machine learning models (Linear Regression, XGBoost, and RandomForestRegressor) were employed to predict food consumption scores, a key indicator of household nutrition. The RandomForestRegressor emerged as the most accurate model, with an average prediction error of 10.6%, though accuracy varied significantly across countries, ranging from 2% to over 30%. Notably, economic indicators were consistently the most significant predictors of average household nutrition, while no single feature dominated across all regions, underscoring the necessity for comprehensive data collection and tailored, country-specific models. These findings highlight the potential of machine learning, particularly Random Forests, to enhance famine prediction, suggesting that continued research and improved data gathering are essential for more effective global hunger forecasting.

[LG-49] Context-Conditioned Spatio-Temporal Predictive Learning for Reliable V2V Channel Prediction

链接: https://arxiv.org/abs/2409.09978
作者: Lei Chu,Daoud Burghal,Michael Neuman,Andreas F. Molisch
关键词-EN: channel state information, optimizing downstream tasks, Achieving reliable multidimensional, reliable multidimensional, channel state
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Achieving reliable multidimensional Vehicle-to-Vehicle (V2V) channel state information (CSI) prediction is both challenging and crucial for optimizing downstream tasks that depend on instantaneous CSI. This work extends traditional prediction approaches by focusing on four-dimensional (4D) CSI, which includes predictions over time, bandwidth, and antenna (TX and RX) space. Such a comprehensive framework is essential for addressing the dynamic nature of mobility environments within intelligent transportation systems, necessitating the capture of both temporal and spatial dependencies across diverse domains. To address this complexity, we propose a novel context-conditioned spatiotemporal predictive learning method. This method leverages causal convolutional long short-term memory (CA-ConvLSTM) to effectively capture dependencies within 4D CSI data, and incorporates context-conditioned attention mechanisms to enhance the efficiency of spatiotemporal memory updates. Additionally, we introduce an adaptive meta-learning scheme tailored for recurrent networks to mitigate the issue of accumulative prediction errors. We validate the proposed method through empirical studies conducted across three different geometric configurations and mobility scenarios. Our results demonstrate that the proposed approach outperforms existing state-of-the-art predictive models, achieving superior performance across various geometries. Moreover, we show that the meta-learning framework significantly enhances the performance of recurrent-based predictive models in highly challenging cross-geometry settings, thus highlighting its robustness and adaptability.

[LG-50] An Offline Adaptation Framework for Constrained Multi-Objective Reinforcement Learning

链接: https://arxiv.org/abs/2409.09958
作者: Qian Lin,Zongkai Liu,Danying Mo,Chao Yu
关键词-EN: multi-objective reinforcement learning, balance multiple objectives, recent years, significant progress, reinforcement learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, significant progress has been made in multi-objective reinforcement learning (RL) research, which aims to balance multiple objectives by incorporating preferences for each objective. In most existing studies, specific preferences must be provided during deployment to indicate the desired policies explicitly. However, designing these preferences depends heavily on human prior knowledge, which is typically obtained through extensive observation of high-performing demonstrations with expected behaviors. In this work, we propose a simple yet effective offline adaptation framework for multi-objective RL problems without assuming handcrafted target preferences, but only given several demonstrations to implicitly indicate the preferences of expected policies. Additionally, we demonstrate that our framework can naturally be extended to meet constraints on safety-critical objectives by utilizing safe demonstrations, even when the safety thresholds are unknown. Empirical results on offline multi-objective and safe tasks demonstrate the capability of our framework to infer policies that align with real preferences while meeting the constraints implied by the provided demonstrations.

[LG-51] Deep Graph Anomaly Detection: A Survey and New Perspectives

链接: https://arxiv.org/abs/2409.09957
作者: Hezhe Qiao,Hanghang Tong,Bo An,Irwin King,Charu Aggarwal,Guansong Pang
关键词-EN: attracted increasing attention, recent years due, unusual graph instances, identify unusual graph, GAD
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 24 pages, 6 figures, and 7 tables

点击查看摘要

Abstract:Graph anomaly detection (GAD), which aims to identify unusual graph instances (nodes, edges, subgraphs, or graphs), has attracted increasing attention in recent years due to its significance in a wide range of applications. Deep learning approaches, graph neural networks (GNNs) in particular, have been emerging as a promising paradigm for GAD, owing to its strong capability in capturing complex structure and/or node attributes in graph data. Considering the large number of methods proposed for GNN-based GAD, it is of paramount importance to summarize the methodologies and findings in the existing GAD studies, so that we can pinpoint effective model designs for tackling open GAD problems. To this end, in this work we aim to present a comprehensive review of deep learning approaches for GAD. Existing GAD surveys are focused on task-specific discussions, making it difficult to understand the technical insights of existing methods and their limitations in addressing some unique challenges in GAD. To fill this gap, we first discuss the problem complexities and their resulting challenges in GAD, and then provide a systematic review of current deep GAD methods from three novel perspectives of methodology, including GNN backbone design, proxy task design for GAD, and graph anomaly measures. To deepen the discussions, we further propose a taxonomy of 13 fine-grained method categories under these three perspectives to provide more in-depth insights into the model designs and their capabilities. To facilitate the experiments and validation, we also summarize a collection of widely-used GAD datasets and empirical comparison. We further discuss multiple open problems to inspire more future high-quality research. A continuously updated repository for datasets, links to the codes of algorithms, and empirical comparison is available at this https URL.

[LG-52] Optimal ablation for interpretability

链接: https://arxiv.org/abs/2409.09951
作者: Maximilian Li,Lucas Janson
关键词-EN: perform relevant computations, machine learning models, identify specific model, specific model components, studies often involve
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Interpretability studies often involve tracing the flow of information through machine learning models to identify specific model components that perform relevant computations for tasks of interest. Prior work quantifies the importance of a model component on a particular task by measuring the impact of performing ablation on that component, or simulating model inference with the component disabled. We propose a new method, optimal ablation (OA), and show that OA-based component importance has theoretical and empirical advantages over measuring importance via other ablation methods. We also show that OA-based component importance can benefit several downstream interpretability tasks, including circuit discovery, localization of factual recall, and latent prediction.

[LG-53] racking the spatial dynamics of the synthetic opioid crisis in the USA 2013-2020 using human mobility-based graph neural network

链接: https://arxiv.org/abs/2409.09945
作者: Zhiyue Xia,Kathleen Stewart
关键词-EN: drug-involved overdose mortalities, Center for Disease, Disease Control, Control and Prevention, synthetic opioid-involved deaths
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Physics and Society (physics.soc-ph)
*备注:

点击查看摘要

Abstract:Synthetic opioids are the most common drugs involved in drug-involved overdose mortalities in the U.S. The Center for Disease Control and Prevention reported that in 2018, about 70% of all drug overdose deaths involved opioids and 67% of all opioid-involved deaths were accounted for by synthetic opioids. In this study, we investigated the spread of synthetic opioids between 2013 and 2020 in the U.S., and analyzed the relationship between the spatiotemporal pattern of synthetic opioid-involved deaths and another key opioid, heroin, and compared patterns of deaths involving these two types of drugs during this time period. Spatial connections between counties were incorporated into a graph convolutional neural network model to represent and analyze the spread of synthetic opioid-involved deaths, and in the context of heroin-involved deaths.

[LG-54] Fault Analysis And Predictive Maintenance Of Induction Motor Using Machine Learning

链接: https://arxiv.org/abs/2409.09944
作者: Kavana Venkatesh,Neethi M
关键词-EN: crucial electrical equipment, range of applications, induction motor, wide range, induction motor faults
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Presented at ICEECCOT-2018, Published in IEEE Xplore, 6 pages, 3 figures

点击查看摘要

Abstract:Induction motors are one of the most crucial electrical equipment and are extensively used in industries in a wide range of applications. This paper presents a machine learning model for the fault detection and classification of induction motor faults by using three phase voltages and currents as inputs. The aim of this work is to protect vital electrical components and to prevent abnormal event progression through early detection and diagnosis. This work presents a fast forward artificial neural network model to detect some of the commonly occurring electrical faults like overvoltage, under voltage, single phasing, unbalanced voltage, overload, ground fault. A separate model free monitoring system wherein the motor itself acts like a sensor is presented and the only monitored signals are the input given to the motor. Limits for current and voltage values are set for the faulty and healthy conditions, which is done by a classifier. Real time data from a 0.33 HP induction motor is used to train and test the neural network. The model so developed analyses the voltage and current values given at a particular instant and classifies the data into no fault or the specific fault. The model is then interfaced with a real motor to accurately detect and classify the faults so that further necessary action can be taken.

[LG-55] Generalizability of Graph Neural Network Force Fields for Predicting Solid-State Properties

链接: https://arxiv.org/abs/2409.09931
作者: Shaswat Mohanty,Yifan Wang,Wei Cai
关键词-EN: Machine-learned force fields, computationally efficient alternative, Machine-learned force, force fields, promise to offer
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Numerical Analysis (math.NA)
*备注: 17 pages, 7 figures

点击查看摘要

Abstract:Machine-learned force fields (MLFFs) promise to offer a computationally efficient alternative to ab initio simulations for complex molecular systems. However, ensuring their generalizability beyond training data is crucial for their wide application in studying solid materials. This work investigates the ability of a graph neural network (GNN)-based MLFF, trained on Lennard-Jones Argon, to describe solid-state phenomena not explicitly included during training. We assess the MLFF’s performance in predicting phonon density of states (PDOS) for a perfect face-centered cubic (FCC) crystal structure at both zero and finite temperatures. Additionally, we evaluate vacancy migration rates and energy barriers in an imperfect crystal using direct molecular dynamics (MD) simulations and the string method. Notably, vacancy configurations were absent from the training data. Our results demonstrate the MLFF’s capability to capture essential solid-state properties with good agreement to reference data, even for unseen configurations. We further discuss data engineering strategies to enhance the generalizability of MLFFs. The proposed set of benchmark tests and workflow for evaluating MLFF performance in describing perfect and imperfect crystals pave the way for reliable application of MLFFs in studying complex solid-state materials.

[LG-56] Mining of Switching Sparse Networks for Missing Value Imputation in Multivariate Time Series KDD2024

链接: https://arxiv.org/abs/2409.09930
作者: Kohei Obata,Koki Kawabata,Yasuko Matsubara,Yasushi Sakurai
关键词-EN: Multivariate time series, Multivariate time, hinders the application, time series, network
类目: Machine Learning (cs.LG)
*备注: Accepted by KDD 2024

点击查看摘要

Abstract:Multivariate time series data suffer from the problem of missing values, which hinders the application of many analytical methods. To achieve the accurate imputation of these missing values, exploiting inter-correlation by employing the relationships between sequences (i.e., a network) is as important as the use of temporal dependency, since a sequence normally correlates with other sequences. Moreover, exploiting an adequate network depending on time is also necessary since the network varies over time. However, in real-world scenarios, we normally know neither the network structure nor when the network changes beforehand. Here, we propose a missing value imputation method for multivariate time series, namely MissNet, that is designed to exploit temporal dependency with a state-space model and inter-correlation by switching sparse networks. The network encodes conditional independence between features, which helps us understand the important relationships for imputation visually. Our algorithm, which scales linearly with reference to the length of the data, alternatively infers networks and fills in missing values using the networks while discovering the switching of the networks. Extensive experiments demonstrate that MissNet outperforms the state-of-the-art algorithms for multivariate time series imputation and provides interpretable results.

[LG-57] Multi-Step Embed to Control: A Novel Deep Learning-based Approach for Surrogate Modelling in Reservoir Simulation

链接: https://arxiv.org/abs/2409.09920
作者: Jungang Chen,Eduardo Gildin,John Killough
关键词-EN: fully descriptive models, computational expensive, expensive as opposed, opposed to fully, fully descriptive
类目: Machine Learning (cs.LG)
*备注: 12 pages, 7 figures

点击查看摘要

Abstract:Reduced-order models, also known as proxy model or surrogate model, are approximate models that are less computational expensive as opposed to fully descriptive models. With the integration of machine learning, these models have garnered increasing research interests recently. However, many existing reduced-order modeling methods, such as embed to control (E2C) and embed to control and observe (E2CO), fall short in long-term predictions due to the accumulation of prediction errors over time. This issue arises partly from the one-step prediction framework inherent in E2C and E2CO architectures. This paper introduces a deep learning-based surrogate model, referred as multi-step embed-to-control model, for the construction of proxy models with improved long-term prediction performance. Unlike E2C and E2CO, the proposed network considers multiple forward transitions in the latent space at a time using Koopman operator, allowing the model to incorporate a sequence of state snapshots during training phrases. Additionally, the loss function of this novel approach has been redesigned to accommodate these multiple transitions and to respect the underlying physical principles. To validate the efficacy of the proposed method, the developed framework was implemented within two-phase (oil and water) reservoir model under a waterflooding scheme. Comparative analysis demonstrate that the proposed model significantly outperforms the conventional E2C model in long-term simulation scenarios. Notably, there was a substantial reduction in temporal errors in the prediction of saturation profiles and a decent improvement in pressure forecasting accuracy.

[LG-58] GRIN: Zero-Shot Metric Depth with Pixel-Level Diffusion

链接: https://arxiv.org/abs/2409.09896
作者: Vitor Guizilini,Pavel Tokmakov,Achal Dave,Rares Ambrus
关键词-EN: computer vision, long-standing problem, problem in computer, Abstract, single image
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:3D reconstruction from a single image is a long-standing problem in computer vision. Learning-based methods address its inherent scale ambiguity by leveraging increasingly large labeled and unlabeled datasets, to produce geometric priors capable of generating accurate predictions across domains. As a result, state of the art approaches show impressive performance in zero-shot relative and metric depth estimation. Recently, diffusion models have exhibited remarkable scalability and generalizable properties in their learned representations. However, because these models repurpose tools originally designed for image generation, they can only operate on dense ground-truth, which is not available for most depth labels, especially in real-world settings. In this paper we present GRIN, an efficient diffusion model designed to ingest sparse unstructured training data. We use image features with 3D geometric positional encodings to condition the diffusion process both globally and locally, generating depth predictions at a pixel-level. With comprehensive experiments across eight indoor and outdoor datasets, we show that GRIN establishes a new state of the art in zero-shot metric monocular depth estimation even when trained from scratch.

[LG-59] Estimating Wage Disparities Using Foundation Models

链接: https://arxiv.org/abs/2409.09894
作者: Keyon Vafa,Susan Athey,David M. Blei
关键词-EN: social science focuses, decomposing group differences, wage gap, gender wage gap, unexplained components
类目: Machine Learning (cs.LG); Econometrics (econ.EM); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:One thread of empirical work in social science focuses on decomposing group differences in outcomes into unexplained components and components explained by observable factors. In this paper, we study gender wage decompositions, which require estimating the portion of the gender wage gap explained by career histories of workers. Classical methods for decomposing the wage gap employ simple predictive models of wages which condition on a small set of simple summaries of labor history. The problem is that these predictive models cannot take advantage of the full complexity of a worker’s history, and the resulting decompositions thus suffer from omitted variable bias (OVB), where covariates that are correlated with both gender and wages are not included in the model. Here we explore an alternative methodology for wage gap decomposition that employs powerful foundation models, such as large language models, as the predictive engine. Foundation models excel at making accurate predictions from complex, high-dimensional inputs. We use a custom-built foundation model, designed to predict wages from full labor histories, to decompose the gender wage gap. We prove that the way such models are usually trained might still lead to OVB, but develop fine-tuning algorithms that empirically mitigate this issue. Our model captures a richer representation of career history than simple models and predicts wages more accurately. In detail, we first provide a novel set of conditions under which an estimator of the wage gap based on a fine-tuned foundation model is \sqrtn -consistent. Building on the theory, we then propose methods for fine-tuning foundation models that minimize OVB. Using data from the Panel Study of Income Dynamics, we find that history explains more of the gender wage gap than standard econometric models can measure, and we identify elements of history that are important for reducing OVB.

[LG-60] Dynamic Fraud Detection: Integrating Reinforcement Learning into Graph Neural Networks

链接: https://arxiv.org/abs/2409.09892
作者: Yuxin Dong,Jianhua Yao,Jiajing Wang,Yingbin Liang,Shuhan Liao,Minheng Xiao
关键词-EN: obtaining financial benefits, Financial fraud refers, financial fraud activities, act of obtaining, benefits through dishonest
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Financial fraud refers to the act of obtaining financial benefits through dishonest means. Such behavior not only disrupts the order of the financial market but also harms economic and social development and breeds other illegal and criminal activities. With the popularization of the internet and online payment methods, many fraudulent activities and money laundering behaviors in life have shifted from offline to online, posing a great challenge to regulatory authorities. How to efficiently detect these financial fraud activities has become an urgent issue that needs to be resolved. Graph neural networks are a type of deep learning model that can utilize the interactive relationships within graph structures, and they have been widely applied in the field of fraud detection. However, there are still some issues. First, fraudulent activities only account for a very small part of transaction transfers, leading to an inevitable problem of label imbalance in fraud detection. At the same time, fraudsters often disguise their behavior, which can have a negative impact on the final prediction results. In addition, existing research has overlooked the importance of balancing neighbor information and central node information. For example, when the central node has too many neighbors, the features of the central node itself are often neglected. Finally, fraud activities and patterns are constantly changing over time, so considering the dynamic evolution of graph edge relationships is also very important.

[LG-61] Flexible Diffusion Scopes with Parameterized Laplacian for Heterophilic Graph Learning

链接: https://arxiv.org/abs/2409.09888
作者: Qincheng Lu,Jiaqi Zhu,Sitao Luan,Xiao-Wen Chang
关键词-EN: Graph Neural Networks, conventional graph Laplacian, Neural Networks, Graph Neural, Graph Convolutional Networks
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:The ability of Graph Neural Networks (GNNs) to capture long-range and global topology information is limited by the scope of conventional graph Laplacian, leading to unsatisfactory performance on some datasets, particularly on heterophilic graphs. To address this limitation, we propose a new class of parameterized Laplacian matrices, which provably offers more flexibility in controlling the diffusion distance between nodes than the conventional graph Laplacian, allowing long-range information to be adaptively captured through diffusion on graph. Specifically, we first prove that the diffusion distance and spectral distance on graph have an order-preserving relationship. With this result, we demonstrate that the parameterized Laplacian can accelerate the diffusion of long-range information, and the parameters in the Laplacian enable flexibility of the diffusion scopes. Based on the theoretical results, we propose topology-guided rewiring mechanism to capture helpful long-range neighborhood information for heterophilic graphs. With this mechanism and the new Laplacian, we propose two GNNs with flexible diffusion scopes: namely the Parameterized Diffusion based Graph Convolutional Networks (PD-GCN) and Graph Attention Networks (PD-GAT). Synthetic experiments reveal the high correlations between the parameters of the new Laplacian and the performance of parameterized GNNs under various graph homophily levels, which verifies that our new proposed GNNs indeed have the ability to adjust the parameters to adaptively capture the global information for different levels of heterophilic graphs. They also outperform the state-of-the-art (SOTA) models on 6 out of 7 real-world benchmark datasets, which further confirms their superiority.

[LG-62] Leiden-Fusion Partitioning Method for Effective Distributed Training of Graph Embeddings ECML-PKDD2024

链接: https://arxiv.org/abs/2409.09887
作者: Yuhe Bai,Camelia Constantin,Hubert Naacke
关键词-EN: handling large networks, effective training frameworks, critical for handling, handling large, effective training
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted at the 2024 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2024)

点击查看摘要

Abstract:In the area of large-scale training of graph embeddings, effective training frameworks and partitioning methods are critical for handling large networks. However, they face two major challenges: 1) existing synchronized distributed frameworks require continuous communication to access information from other machines, and 2) the inability of current partitioning methods to ensure that subgraphs remain connected components without isolated nodes, which is essential for effective training of GNNs since training relies on information aggregation from neighboring nodes. To address these issues, we introduce a novel partitioning method, named Leiden-Fusion, designed for large-scale training of graphs with minimal communication. Our method extends the Leiden community detection algorithm with a greedy algorithm that merges the smallest communities with highly connected neighboring communities. Our method guarantees that, for an initially connected graph, each partition is a densely connected subgraph with no isolated nodes. After obtaining the partitions, we train a GNN for each partition independently, and finally integrate all embeddings for node classification tasks, which significantly reduces the need for network communication and enhances the efficiency of distributed graph training. We demonstrate the effectiveness of our method through extensive evaluations on several benchmark datasets, achieving high efficiency while preserving the quality of the graph embeddings for node classification tasks.

[LG-63] Proximal Ranking Policy Optimization for Practical Safety in Counterfactual Learning to Rank RECSYS2024

链接: https://arxiv.org/abs/2409.09881
作者: Shashank Gupta,Harrie Oosterhuis,Maarten de Rijke
关键词-EN: Counterfactual learning, CLTR, produce sub-optimal models, produce sub-optimal, PRPO
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: Accepted at the CONSEQUENCES 2024 workshop, co-located with ACM RecSys 2024

点击查看摘要

Abstract:Counterfactual learning to rank (CLTR) can be risky and, in various circumstances, can produce sub-optimal models that hurt performance when deployed. Safe CLTR was introduced to mitigate these risks when using inverse propensity scoring to correct for position bias. However, the existing safety measure for CLTR is not applicable to state-of-the-art CLTR methods, cannot handle trust bias, and relies on specific assumptions about user behavior. We propose a novel approach, proximal ranking policy optimization (PRPO), that provides safety in deployment without assumptions about user behavior. PRPO removes incentives for learning ranking behavior that is too dissimilar to a safe ranking model. Thereby, PRPO imposes a limit on how much learned models can degrade performance metrics, without relying on any specific user assumptions. Our experiments show that PRPO provides higher performance than the existing safe inverse propensity scoring approach. PRPO always maintains safety, even in maximally adversarial situations. By avoiding assumptions, PRPO is the first method with unconditional safety in deployment that translates to robust safety for real-world applications.

[LG-64] Scaling Continuous Kernels with Sparse Fourier Domain Learning

链接: https://arxiv.org/abs/2409.09875
作者: Clayton Harper,Luke Wood,Peter Gerstoft,Eric C. Larson
关键词-EN: parameter efficiency, continuous kernel representations, address three key, key challenges, computational efficiency
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We address three key challenges in learning continuous kernel representations: computational efficiency, parameter efficiency, and spectral bias. Continuous kernels have shown significant potential, but their practical adoption is often limited by high computational and memory demands. Additionally, these methods are prone to spectral bias, which impedes their ability to capture high-frequency details. To overcome these limitations, we propose a novel approach that leverages sparse learning in the Fourier domain. Our method enables the efficient scaling of continuous kernels, drastically reduces computational and memory requirements, and mitigates spectral bias by exploiting the Gibbs phenomenon.

[LG-65] Constructing a Singing Style Caption Dataset

链接: https://arxiv.org/abs/2409.09866
作者: Hyunjong Ok,Jaeho Lee
关键词-EN: Singing voice synthesis, voice generation, synthesis and conversion, conversion have emerged, emerged as significant
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Preprint

点击查看摘要

Abstract:Singing voice synthesis and conversion have emerged as significant subdomains of voice generation, leading to much demands on prompt-conditioned generation. Unlike common voice data, generating a singing voice requires an understanding of various associated vocal and musical characteristics, such as the vocal tone of the singer or emotional expressions. However, existing open-source audio-text datasets for voice generation tend to capture only a very limited range of attributes, often missing musical characteristics of the audio. To fill this gap, we introduce S2Cap, an audio-text pair dataset with a diverse set of attributes. S2Cap consists of pairs of textual prompts and music audio samples with a wide range of vocal and musical attributes, including pitch, volume, tempo, mood, singer’s gender and age, and musical genre and emotional expression. Utilizing S2Cap, we suggest an effective novel baseline algorithm for singing style captioning. Singing style captioning is a relative task to voice generation that generates text descriptions of vocal characteristics, which we first suggested. First, to mitigate the misalignment between the audio encoder and the text decoder, we present a novel mechanism called CRESCENDO, which utilizes positive-pair similarity learning to synchronize the embedding spaces of a pretrained audio encoder to get similar embeddings with a text encoder. We additionally supervise the model using the singer’s voice, which is demixed by the accompaniment. This supervision allows the model to more accurately capture vocal characteristics, leading to improved singing style captions that better reflect the style of the singer. The dataset and the codes are available at \bulurlthis https URL.

[LG-66] A Survey of Out-of-distribution Generalization for Graph Machine Learning from a Causal View

链接: https://arxiv.org/abs/2409.09858
作者: Jing Ma
关键词-EN: range of tasks, successfully applied, wide range, GML, Graph machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15 pages, 2 figures, 1 table

点击查看摘要

Abstract:Graph machine learning (GML) has been successfully applied across a wide range of tasks. Nonetheless, GML faces significant challenges in generalizing over out-of-distribution (OOD) data, which raises concerns about its wider applicability. Recent advancements have underscored the crucial role of causality-driven approaches in overcoming these generalization challenges. Distinct from traditional GML methods that primarily rely on statistical dependencies, causality-focused strategies delve into the underlying causal mechanisms of data generation and model prediction, thus significantly improving the generalization of GML across different environments. This paper offers a thorough review of recent progress in causality-involved GML generalization. We elucidate the fundamental concepts of employing causality to enhance graph model generalization and categorize the various approaches, providing detailed descriptions of their methodologies and the connections among them. Furthermore, we explore the incorporation of causality in other related important areas of trustworthy GML, such as explanation, fairness, and robustness. Concluding with a discussion on potential future research directions, this review seeks to articulate the continuing development and future potential of causality in enhancing the trustworthiness of graph machine learning.

[LG-67] A Benchmark Dataset with Larger Context for Non-Factoid Question Answering over Islamic Text

链接: https://arxiv.org/abs/2409.09844
作者: Faiza Qamar,Seemab Latif,Rabia Latif
关键词-EN: Prophet Muhammad, today digital era, digital era necessitates, era necessitates efficient, Accessing and comprehending
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accessing and comprehending religious texts, particularly the Quran (the sacred scripture of Islam) and Ahadith (the corpus of the sayings or traditions of the Prophet Muhammad), in today’s digital era necessitates efficient and accurate Question-Answering (QA) systems. Yet, the scarcity of QA systems tailored specifically to the detailed nature of inquiries about the Quranic Tafsir (explanation, interpretation, context of Quran for clarity) and Ahadith poses significant challenges. To address this gap, we introduce a comprehensive dataset meticulously crafted for QA purposes within the domain of Quranic Tafsir and Ahadith. This dataset comprises a robust collection of over 73,000 question-answer pairs, standing as the largest reported dataset in this specialized domain. Importantly, both questions and answers within the dataset are meticulously enriched with contextual information, serving as invaluable resources for training and evaluating tailored QA systems. However, while this paper highlights the dataset’s contributions and establishes a benchmark for evaluating QA performance in the Quran and Ahadith domains, our subsequent human evaluation uncovered critical insights regarding the limitations of existing automatic evaluation techniques. The discrepancy between automatic evaluation metrics, such as ROUGE scores, and human assessments became apparent. The human evaluation indicated significant disparities: the model’s verdict consistency with expert scholars ranged between 11% to 20%, while its contextual understanding spanned a broader spectrum of 50% to 90%. These findings underscore the necessity for evaluation techniques that capture the nuances and complexities inherent in understanding religious texts, surpassing the limitations of traditional automatic metrics.

[LG-68] Generating Synthetic Free-text Medical Records with Low Re-identification Risk using Masked Language Modeling

链接: https://arxiv.org/abs/2409.09831
作者: Samuel Belkadi,Libo Ren,Nicolo Micheletti,Lifeng Han,Goran Nenadic
关键词-EN: Masked Language Modeling, Language Modeling, Masked Language, generates synthetic free-text, free-text medical records
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we present a system that generates synthetic free-text medical records, such as discharge summaries, admission notes and doctor correspondences, using Masked Language Modeling (MLM). Our system is designed to preserve the critical information of the records while introducing significant diversity and minimizing re-identification risk. The system incorporates a de-identification component that uses Philter to mask Protected Health Information (PHI), followed by a Medical Entity Recognition (NER) model to retain key medical information. We explore various masking ratios and mask-filling techniques to balance the trade-off between diversity and fidelity in the synthetic outputs without affecting overall readability. Our results demonstrate that the system can produce high-quality synthetic data with significant diversity while achieving a HIPAA-compliant PHI recall rate of 0.96 and a low re-identification risk of 0.035. Furthermore, downstream evaluations using a NER task reveal that the synthetic data can be effectively used to train models with performance comparable to those trained on real data. The flexibility of the system allows it to be adapted for specific use cases, making it a valuable tool for privacy-preserving data generation in medical research and healthcare applications.

[LG-69] Latent Diffusion Models for Controllable RNA Sequence Generation

链接: https://arxiv.org/abs/2409.09828
作者: Kaixuan Huang,Yukang Yang,Kaidi Fu,Yanyi Chu,Le Cong,Mengdi Wang
关键词-EN: optimizing discrete RNA, RNA, paper presents RNAdiffusion, RNA sequences, discrete RNA sequences
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:This paper presents RNAdiffusion, a latent diffusion model for generating and optimizing discrete RNA sequences. RNA is a particularly dynamic and versatile molecule in biological processes. RNA sequences exhibit high variability and diversity, characterized by their variable lengths, flexible three-dimensional structures, and diverse functions. We utilize pretrained BERT-type models to encode raw RNAs into token-level biologically meaningful representations. A Q-Former is employed to compress these representations into a fixed-length set of latent vectors, with an autoregressive decoder trained to reconstruct RNA sequences from these latent variables. We then develop a continuous diffusion model within this latent space. To enable optimization, we train reward networks to estimate functional properties of RNA from the latent variables. We employ gradient-based guidance during the backward diffusion process, aiming to generate RNA sequences that are optimized for higher rewards. Empirical experiments confirm that RNAdiffusion generates non-coding RNAs that align with natural distributions across various biological indicators. We fine-tuned the diffusion model on untranslated regions (UTRs) of mRNA and optimize sample sequences for protein translation efficiencies. Our guided diffusion model effectively generates diverse UTR sequences with high Mean Ribosome Loading (MRL) and Translation Efficiency (TE), surpassing baselines. These results hold promise for studies on RNA sequence-function relationships, protein synthesis, and enhancing therapeutic RNA design.

[LG-70] A Simpler Alternative to Variational Regularized Counterfactual Risk Minimization RECSYS’24

链接: https://arxiv.org/abs/2409.09819
作者: Hua Chang Bakker,Shashank Gupta,Harrie Oosterhuis
关键词-EN: Variance regularized counterfactual, counterfactual risk minimization, regularized counterfactual risk, Variance regularized, alternative off-policy learning
类目: Machine Learning (cs.LG)
*备注: Accepted at the CONSEQUENCES '24 workshop, co-located with ACM RecSys '24

点击查看摘要

Abstract:Variance regularized counterfactual risk minimization (VRCRM) has been proposed as an alternative off-policy learning (OPL) method. VRCRM method uses a lower-bound on the f -divergence between the logging policy and the target policy as regularization during learning and was shown to improve performance over existing OPL alternatives on multi-label classification tasks. In this work, we revisit the original experimental setting of VRCRM and propose to minimize the f -divergence directly, instead of optimizing for the lower bound using a f -GAN approach. Surprisingly, we were unable to reproduce the results reported in the original setting. In response, we propose a novel simpler alternative to f-divergence optimization by minimizing a direct approximation of f-divergence directly, instead of a f -GAN based lower bound. Experiments showed that minimizing the divergence using f -GANs did not work as expected, whereas our proposed novel simpler alternative works better empirically.

[LG-71] PROSE-FD: A Multimodal PDE Foundation Model for Learning Multiple Operators for Forecasting Fluid Dynamics

链接: https://arxiv.org/abs/2409.09811
作者: Yuxuan Liu,Jingmin Sun,Xinjie He,Griffin Pinney,Zecheng Zhang,Hayden Schaeffer
关键词-EN: distinct fluid dynamics, fluid dynamics settings, zero-shot multimodal PDE, multimodal PDE foundational, heterogeneous two-dimensional physical
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:We propose PROSE-FD, a zero-shot multimodal PDE foundational model for simultaneous prediction of heterogeneous two-dimensional physical systems related to distinct fluid dynamics settings. These systems include shallow water equations and the Navier-Stokes equations with incompressible and compressible flow, regular and complex geometries, and different buoyancy settings. This work presents a new transformer-based multi-operator learning approach that fuses symbolic information to perform operator-based data prediction, i.e. non-autoregressive. By incorporating multiple modalities in the inputs, the PDE foundation model builds in a pathway for including mathematical descriptions of the physical behavior. We pre-train our foundation model on 6 parametric families of equations collected from 13 datasets, including over 60K trajectories. Our model outperforms popular operator learning, computer vision, and multi-physics models, in benchmark forward prediction tasks. We test our architecture choices with ablation studies.

[LG-72] Federated Learning in Adversarial Environments: Testbed Design and Poisoning Resilience in Cybersecurity

链接: https://arxiv.org/abs/2409.09794
作者: Hao Jian Huang,Bekzod Iskandarov,Mizanur Rahman,Hakan T. Otal,M. Abdullah Canbaz
关键词-EN: Federated Learning, paper presents, presents the design, design and implementation, evaluating its resilience
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:This paper presents the design and implementation of a Federated Learning (FL) testbed, focusing on its application in cybersecurity and evaluating its resilience against poisoning attacks. Federated Learning allows multiple clients to collaboratively train a global model while keeping their data decentralized, addressing critical needs for data privacy and security, particularly in sensitive fields like cybersecurity. Our testbed, built using the Flower framework, facilitates experimentation with various FL frameworks, assessing their performance, scalability, and ease of integration. Through a case study on federated intrusion detection systems, we demonstrate the testbed’s capabilities in detecting anomalies and securing critical infrastructure without exposing sensitive network data. Comprehensive poisoning tests, targeting both model and data integrity, evaluate the system’s robustness under adversarial conditions. Our results show that while federated learning enhances data privacy and distributed learning, it remains vulnerable to poisoning attacks, which must be mitigated to ensure its reliability in real-world applications.

[LG-73] Enhancing Data Quality through Self-learning on Imbalanced Financial Risk Data

链接: https://arxiv.org/abs/2409.09792
作者: Xu Sun,Zixuan Qin,Shun Zhang,Yuexian Wang,Li Huang
关键词-EN: significant economic implications, credit default prediction, high-risk class instances, fraud detection, accurate identification
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the financial risk domain, particularly in credit default prediction and fraud detection, accurate identification of high-risk class instances is paramount, as their occurrence can have significant economic implications. Although machine learning models have gained widespread adoption for risk prediction, their performance is often hindered by the scarcity and diversity of high-quality data. This limitation stems from factors in datasets such as small risk sample sizes, high labeling costs, and severe class imbalance, which impede the models’ ability to learn effectively and accurately forecast critical events. This study investigates data pre-processing techniques to enhance existing financial risk datasets by introducing TriEnhance, a straightforward technique that entails: (1) generating synthetic samples specifically tailored to the minority class, (2) filtering using binary feedback to refine samples, and (3) self-learning with pseudo-labels. Our experiments across six benchmark datasets reveal the efficacy of TriEnhance, with a notable focus on improving minority class calibration, a key factor for developing more robust financial risk prediction systems.

[LG-74] BEnDEM:A Boltzmann Sampler Based on Bootstrapped Denoising Energy Matching

链接: https://arxiv.org/abs/2409.09787
作者: RuiKang OuYang,Bo Qiang,José Miguel Hernández-Lobato
关键词-EN: Developing an efficient, efficient sampler capable, Boltzmann distribution, molecular dynamics, identically distributed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation (stat.CO); Machine Learning (stat.ML)
*备注: 20 pages, 7 figures, 2 tables

点击查看摘要

Abstract:Developing an efficient sampler capable of generating independent and identically distributed (IID) samples from a Boltzmann distribution is a crucial challenge in scientific research, e.g. molecular dynamics. In this work, we intend to learn neural samplers given energy functions instead of data sampled from the Boltzmann distribution. By learning the energies of the noised data, we propose a diffusion-based sampler, ENERGY-BASED DENOISING ENERGY MATCHING, which theoretically has lower variance and more complexity compared to related works. Furthermore, a novel bootstrapping technique is applied to EnDEM to balance between bias and variance. We evaluate EnDEM and BEnDEM on a 2-dimensional 40 Gaussian Mixture Model (GMM) and a 4-particle double-welling potential (DW-4). The experimental results demonstrate that BEnDEM can achieve state-of-the-art performance while being more robust.

[LG-75] Large Language Model Based Generative Error Correction: A Challenge and Baselines forSpeech Recognition Speaker Tagging and Emotion Recognition

链接: https://arxiv.org/abs/2409.09785
作者: Chao-Han Huck Yang,Taejin Park,Yuan Gong,Yuanchao Li,Zhehuai Chen,Yen-Ting Lin,Chen Chen,Yuchen Hu,Kunal Dhawan,Piotr Żelasko,Chao Zhang,Yun-Nung Chen,Yu Tsao,Jagadeesh Balam,Boris Ginsburg,Sabato Marco Siniscalchi,Eng Siong Chng,Peter Bell,Catherine Lai,Shinji Watanabe,Andreas Stolcke
关键词-EN: text decoding results, enhance acoustic modeling, automatic speech recognition, ASR, recent advances
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: IEEE SLT 2024. The initial draft version has been done in December 2023. Post-ASR Text Processing and Understanding Community: this https URL

点击查看摘要

Abstract:Given recent advances in generative AI technology, a key question is how large language models (LLMs) can enhance acoustic modeling tasks using text decoding results from a frozen, pretrained automatic speech recognition (ASR) model. To explore new capabilities in language modeling for speech processing, we introduce the generative speech transcription error correction (GenSEC) challenge. This challenge comprises three post-ASR language modeling tasks: (i) post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion recognition. These tasks aim to emulate future LLM-based agents handling voice-based interfaces while remaining accessible to a broad audience by utilizing open pretrained language models or agent-based APIs. We also discuss insights from baseline evaluations, as well as lessons learned for designing future evaluations.

[LG-76] Learning Rate Optimization for Deep Neural Networks Using Lipschitz Bandits

链接: https://arxiv.org/abs/2409.09783
作者: Padma Priyanka,Sheetal Kalyani,Avhishek Chatterjee
关键词-EN: Learning rate, crucial parameter, neural networks, Learning, tuned learning rate
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning rate is a crucial parameter in training of neural networks. A properly tuned learning rate leads to faster training and higher test accuracy. In this paper, we propose a Lipschitz bandit-driven approach for tuning the learning rate of neural networks. The proposed approach is compared with the popular HyperOpt technique used extensively for hyperparameter optimization and the recently developed bandit-based algorithm BLiE. The results for multiple neural network architectures indicate that our method finds a better learning rate using a) fewer evaluations and b) lesser number of epochs per evaluation, when compared to both HyperOpt and BLiE. Thus, the proposed approach enables more efficient training of neural networks, leading to lower training time and lesser computational cost.

[LG-77] Rewind-to-Delete: Certified Machine Unlearning for Nonconvex Functions

链接: https://arxiv.org/abs/2409.09778
作者: Siqiao Mu,Diego Klabjan
关键词-EN: enforce data privacy, efficiently remove data, respect a user, efficiently remove, remove corrupted
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine unlearning algorithms aim to efficiently remove data from a model without retraining it from scratch, in order to enforce data privacy, remove corrupted or outdated data, or respect a user’s right to be forgotten." Certified machine unlearning is a strong theoretical guarantee that quantifies the extent to which data is erased from the model weights. Most prior works in certified unlearning focus on models trained on convex or strongly convex loss functions, which benefit from convenient convergence guarantees and the existence of global minima. For nonconvex objectives, existing algorithms rely on limiting assumptions and expensive computations that hinder practical implementations. In this work, we propose a simple first-order algorithm for unlearning on general nonconvex loss functions which unlearns by rewinding" to an earlier step during the learning process and then performs gradient descent on the loss function of the retained data points. Our algorithm is black-box, in that it can be directly applied to models pretrained with vanilla gradient descent with no prior consideration of unlearning. We prove (\epsilon, \delta) certified unlearning and performance guarantees that establish the privacy-utility-complexity tradeoff of our algorithm, with special consideration for nonconvex functions that satisfy the Polyak-Lojasiewicz inequality.

[LG-78] owards Multi-view Graph Anomaly Detection with Similarity-Guided Contrastive Clustering

链接: https://arxiv.org/abs/2409.09770
作者: Lecheng Zheng,John R. Birge,Yifang Zhang,Jingrui He
关键词-EN: Anomaly detection, real-world applications, plays an important, important role, contrastive learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anomaly detection on graphs plays an important role in many real-world applications. Usually, these data are composed of multiple types (e.g., user information and transaction records for financial data), thus exhibiting view heterogeneity. Therefore, it can be challenging to leverage such multi-view information and learn the graph’s contextual information to identify rare anomalies. To tackle this problem, many deep learning-based methods utilize contrastive learning loss as a regularization term to learn good representations. However, many existing contrastive-based methods show that traditional contrastive learning losses fail to consider the semantic information (e.g., class membership information). In addition, we theoretically show that clustering-based contrastive learning also easily leads to a sub-optimal solution. To address these issues, in this paper, we proposed an autoencoder-based clustering framework regularized by a similarity-guided contrastive loss to detect anomalous nodes. Specifically, we build a similarity map to help the model learn robust representations without imposing a hard margin constraint between the positive and negative pairs. Theoretically, we show that the proposed similarity-guided loss is a variant of contrastive learning loss, and how it alleviates the issue of unreliable pseudo-labels with the connection to graph spectral clustering. Experimental results on several datasets demonstrate the effectiveness and efficiency of our proposed framework.

[LG-79] Analysis of Centrifugal Clutches in Two-Speed Automatic Transmissions with Deep Learning-Based Engagement Prediction

链接: https://arxiv.org/abs/2409.09755
作者: Bo-Yi Lin,Kai Chun Lin
关键词-EN: automotive torque transfer, two-speed automatic transmission, comprehensive numerical analysis, clutch systems integrated, paper presents
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:This paper presents a comprehensive numerical analysis of centrifugal clutch systems integrated with a two-speed automatic transmission, a key component in automotive torque transfer. Centrifugal clutches enable torque transmission based on rotational speed without external controls. The study systematically examines various clutch configurations effects on transmission dynamics, focusing on torque transfer, upshifting, and downshifting behaviors under different conditions. A Deep Neural Network (DNN) model predicts clutch engagement using parameters such as spring preload and shoe mass, offering an efficient alternative to complex simulations. The integration of deep learning and numerical modeling provides critical insights for optimizing clutch designs, enhancing transmission performance and efficiency.

[LG-80] he Optimality of (Accelerated) SGD for High-Dimensional Quadratic Optimization

链接: https://arxiv.org/abs/2409.09745
作者: Haihan Zhang,Yuanshi Liu,Qianwen Chen,Cong Fang
关键词-EN: Stochastic gradient descent, neural network training, Stochastic gradient, SGD, gradient descent
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 46 pages

点击查看摘要

Abstract:Stochastic gradient descent (SGD) is a widely used algorithm in machine learning, particularly for neural network training. Recent studies on SGD for canonical quadratic optimization or linear regression show it attains well generalization under suitable high-dimensional settings. However, a fundamental question – for what kinds of high-dimensional learning problems SGD and its accelerated variants can achieve optimality has yet to be well studied. This paper investigates SGD with two essential components in practice: exponentially decaying step size schedule and momentum. We establish the convergence upper bound for momentum accelerated SGD (ASGD) and propose concrete classes of learning problems under which SGD or ASGD achieves min-max optimal convergence rates. The characterization of the target function is based on standard power-law decays in (functional) linear regression. Our results unveil new insights for understanding the learning bias of SGD: (i) SGD is efficient in learning ``dense’’ features where the corresponding weights are subject to an infinity norm constraint; (ii) SGD is efficient for easy problem without suffering from the saturation effect; (iii) momentum can accelerate the convergence rate by order when the learning problem is relatively hard. To our knowledge, this is the first work to clearly identify the optimal boundary of SGD versus ASGD for the problem under mild settings.

[LG-81] OML-AD: Online Machine Learning for Anomaly Detection in Time Series Data

链接: https://arxiv.org/abs/2409.09742
作者: Sebastian Wette,Florian Heinrichs
关键词-EN: Time series, variety of applications, manufacturing processes, financial data streams, ubiquitous and occur
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 14 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Time series are ubiquitous and occur naturally in a variety of applications – from data recorded by sensors in manufacturing processes, over financial data streams to climate data. Different tasks arise, such as regression, classification or segmentation of the time series. However, to reliably solve these challenges, it is important to filter out abnormal observations that deviate from the usual behavior of the time series. While many anomaly detection methods exist for independent data and stationary time series, these methods are not applicable to non-stationary time series. To allow for non-stationarity in the data, while simultaneously detecting anomalies, we propose OML-AD, a novel approach for anomaly detection (AD) based on online machine learning (OML). We provide an implementation of OML-AD within the Python library River and show that it outperforms state-of-the-art baseline methods in terms of accuracy and computational efficiency.

[LG-82] From Challenges and Pitfalls to Recommendations and Opportunities: Implementing Federated Learning in Healthcare

链接: https://arxiv.org/abs/2409.09727
作者: Ming Li,Pengcheng Xu,Junjie Hu,Zeyu Tang,Guang Yang
关键词-EN: holds great potential, learning holds great, enabling large-scale healthcare, large-scale healthcare research, Federated learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated learning holds great potential for enabling large-scale healthcare research and collaboration across multiple centres while ensuring data privacy and security are not compromised. Although numerous recent studies suggest or utilize federated learning based methods in healthcare, it remains unclear which ones have potential clinical utility. This review paper considers and analyzes the most recent studies up to May 2024 that describe federated learning based methods in healthcare. After a thorough review, we find that the vast majority are not appropriate for clinical use due to their methodological flaws and/or underlying biases which include but are not limited to privacy concerns, generalization issues, and communication costs. As a result, the effectiveness of federated learning in healthcare is significantly compromised. To overcome these challenges, we provide recommendations and promising opportunities that might be implemented to resolve these problems and improve the quality of model development in federated learning with healthcare.

[LG-83] Measuring Recency Bias In Sequential Recommendation Systems RECSYS’24

链接: https://arxiv.org/abs/2409.09722
作者: Jeonglyul Oh,Sungzoon Cho
关键词-EN: overly high emphasis, sequential recommendation system, recommendation system refers, Recency bias, recent items
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted at the CONSEQUENCES '24 workshop, co-located with ACM RecSys '24

点击查看摘要

Abstract:Recency bias in a sequential recommendation system refers to the overly high emphasis placed on recent items within a user session. This bias can diminish the serendipity of recommendations and hinder the system’s ability to capture users’ long-term interests, leading to user disengagement. We propose a simple yet effective novel metric specifically designed to quantify recency bias. Our findings also demonstrate that high recency bias measured in our proposed metric adversely impacts recommendation performance too, and mitigating it results in improved recommendation performances across all models evaluated in our experiments, thus highlighting the importance of measuring recency bias.

[LG-84] Finetuning CLIP to Reason about Pairwise Differences

链接: https://arxiv.org/abs/2409.09721
作者: Dylan Sam,Devin Willmott,Joao D. Semedo,J. Zico Kolter
关键词-EN: embedding space, Vision-language models, resulting embedding space, CLIP, embedding
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages

点击查看摘要

Abstract:Vision-language models (VLMs) such as CLIP are trained via contrastive learning between text and image pairs, resulting in aligned image and text embeddings that are useful for many downstream tasks. A notable drawback of CLIP, however, is that the resulting embedding space seems to lack some of the structure of their purely text-based alternatives. For instance, while text embeddings have been long noted to satisfy \emphanalogies in embedding space using vector arithmetic, CLIP has no such property. In this paper, we propose an approach to natively train CLIP in a contrastive manner to reason about differences in embedding space. We finetune CLIP so that the differences in image embedding space correspond to \emphtext descriptions of the image differences, which we synthetically generate with large language models on image-caption paired datasets. We first demonstrate that our approach yields significantly improved capabilities in ranking images by a certain attribute (e.g., elephants are larger than cats), which is useful in retrieval or constructing attribute-based classifiers, and improved zeroshot classification performance on many downstream image classification tasks. In addition, our approach enables a new mechanism for inference that we refer to as comparative prompting, where we leverage prior knowledge of text descriptions of differences between classes of interest, achieving even larger performance gains in classification. Finally, we illustrate that the resulting embeddings obey a larger degree of geometric properties in embedding space, such as in text-to-image generation.

[LG-85] ELSA: Exploiting Layer-wise N:M Sparsity for Vision Transformer Acceleration

链接: https://arxiv.org/abs/2409.09708
作者: Ning-Chi Huang,Chi-Chih Chang,Wei-Cheng Lin,Endri Taka,Diana Marculescu,Kai-Chiang Wu
关键词-EN: deep neural networks, sparse matrix multiplication, emerging model compression, neural networks, matrix multiplication
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract: N:M sparsity is an emerging model compression method supported by more and more accelerators to speed up sparse matrix multiplication in deep neural networks. Most existing N:M sparsity methods compress neural networks with a uniform setting for all layers in a network or heuristically determine the layer-wise configuration by considering the number of parameters in each layer. However, very few methods have been designed for obtaining a layer-wise customized N:M sparse configuration for vision transformers (ViTs), which usually consist of transformer blocks involving the same number of parameters. In this work, to address the challenge of selecting suitable sparse configuration for ViTs on N:M sparsity-supporting accelerators, we propose ELSA, Exploiting Layer-wise N:M Sparsity for ViTs. Considering not only all N:M sparsity levels supported by a given accelerator but also the expected throughput improvement, our methodology can reap the benefits of accelerators supporting mixed sparsity by trading off negligible accuracy loss with both memory usage and inference time reduction for ViT models. For instance, our approach achieves a noteworthy 2.9 \times reduction in FLOPs for both Swin-B and DeiT-B with only a marginal degradation of accuracy on ImageNet. Our code will be released upon paper acceptance.

[LG-86] AlpaPICO: Extraction of PICO Frames from Clinical Trial Documents Using LLMs

链接: https://arxiv.org/abs/2409.09704
作者: Madhusudan Ghosh,Shrimon Mukherjee,Asmit Ganguly,Partha Basuchowdhuri,Sudip Kumar Naskar,Debasis Ganguly
关键词-EN: clinical trial reports, clinical trial, conduct systematic reviews, scrutinizing systematic reviews, systematic reviews
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted at Methods

点击查看摘要

Abstract:In recent years, there has been a surge in the publication of clinical trial reports, making it challenging to conduct systematic reviews. Automatically extracting Population, Intervention, Comparator, and Outcome (PICO) from clinical trial studies can alleviate the traditionally time-consuming process of manually scrutinizing systematic reviews. Existing approaches of PICO frame extraction involves supervised approach that relies on the existence of manually annotated data points in the form of BIO label tagging. Recent approaches, such as In-Context Learning (ICL), which has been shown to be effective for a number of downstream NLP tasks, require the use of labeled examples. In this work, we adopt ICL strategy by employing the pretrained knowledge of Large Language Models (LLMs), gathered during the pretraining phase of an LLM, to automatically extract the PICO-related terminologies from clinical trial documents in unsupervised set up to bypass the availability of large number of annotated data instances. Additionally, to showcase the highest effectiveness of LLM in oracle scenario where large number of annotated samples are available, we adopt the instruction tuning strategy by employing Low Rank Adaptation (LORA) to conduct the training of gigantic model in low resource environment for the PICO frame extraction task. Our empirical results show that our proposed ICL-based framework produces comparable results on all the version of EBM-NLP datasets and the proposed instruction tuned version of our framework produces state-of-the-art results on all the different EBM-NLP datasets. Our project is available at \urlthis https URL.

[LG-87] GFlowNet Pretraining with Inexpensive Rewards

链接: https://arxiv.org/abs/2409.09702
作者: Mohit Pandey,Gopeshh Subbaraj,Emmanuel Bengio
关键词-EN: Generative Flow Networks, Flow Networks, unnormalized reward distributions, Generative Flow, high-quality molecular structures
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Generative Flow Networks (GFlowNets), a class of generative models have recently emerged as a suitable framework for generating diverse and high-quality molecular structures by learning from unnormalized reward distributions. Previous works in this direction often restrict exploration by using predefined molecular fragments as building blocks, limiting the chemical space that can be accessed. In this work, we introduce Atomic GFlowNets (A-GFNs), a foundational generative model leveraging individual atoms as building blocks to explore drug-like chemical space more comprehensively. We propose an unsupervised pre-training approach using offline drug-like molecule datasets, which conditions A-GFNs on inexpensive yet informative molecular descriptors such as drug-likeliness, topological polar surface area, and synthetic accessibility scores. These properties serve as proxy rewards, guiding A-GFNs towards regions of chemical space that exhibit desirable pharmacological properties. We further our method by implementing a goal-conditioned fine-tuning process, which adapts A-GFNs to optimize for specific target properties. In this work, we pretrain A-GFN on the ZINC15 offline dataset and employ robust evaluation metrics to show the effectiveness of our approach when compared to other relevant baseline methods in drug design.

[LG-88] Predicting building types and functions at transnational scale

链接: https://arxiv.org/abs/2409.09692
作者: Jonas Fill,Michael Eichelbeck,Michael Ebner
关键词-EN: numerous energy applications, Building-specific knowledge, energy applications, important for numerous, numerous energy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Building-specific knowledge such as building type and function information is important for numerous energy applications. However, comprehensive datasets containing this information for individual households are missing in many regions of Europe. For the first time, we investigate whether it is feasible to predict building types and functional classes at a European scale based on only open GIS datasets available across countries. We train a graph neural network (GNN) classifier on a large-scale graph dataset consisting of OpenStreetMap (OSM) buildings across the EU, Norway, Switzerland, and the UK. To efficiently perform training using the large-scale graph, we utilize localized subgraphs. A graph transformer model achieves a high Cohen’s kappa coefficient of 0.754 when classifying buildings into 9 classes, and a very high Cohen’s kappa coefficient of 0.844 when classifying buildings into the residential and non-residential classes. The experimental results imply three core novel contributions to literature. Firstly, we show that building classification across multiple countries is possible using a multi-source dataset consisting of information about 2D building shape, land use, degree of urbanization, and countries as input, and OSM tags as ground truth. Secondly, our results indicate that GNN models that consider contextual information about building neighborhoods improve predictive performance compared to models that only consider individual buildings and ignore the neighborhood. Thirdly, we show that training with GNNs on localized subgraphs instead of standard GNNs improves performance for the task of building classification.

[LG-89] raining Safe Neural Networks with Global SDP Bounds

链接: https://arxiv.org/abs/2409.09687
作者: Roman Soletskyi,David “davidad” Dalrymple
关键词-EN: formal safety guarantees, semidefinite programming, SDP, paper presents, guarantees using semidefinite
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:This paper presents a novel approach to training neural networks with formal safety guarantees using semidefinite programming (SDP) for verification. Our method focuses on verifying safety over large, high-dimensional input regions, addressing limitations of existing techniques that focus on adversarial robustness bounds. We introduce an ADMM-based training scheme for an accurate neural network classifier on the Adversarial Spheres dataset, achieving provably perfect recall with input dimensions up to d=40 . This work advances the development of reliable neural network verification methods for high-dimensional systems, with potential applications in safe RL policies.

[LG-90] Mitigating Dimensionality in 2D Rectangle Packing Problem under Reinforcement Learning Schema

链接: https://arxiv.org/abs/2409.09677
作者: Waldemar Kołodziejczyk,Mariusz Kaleta
关键词-EN: Reinforcement Learning, two-dimensional rectangular packing, application of Reinforcement, Proximal Policy Optimization, paper explores
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 5th Polish Conference on Artificial Intelligence

点击查看摘要

Abstract:This paper explores the application of Reinforcement Learning (RL) to the two-dimensional rectangular packing problem. We propose a reduced representation of the state and action spaces that allow us for high granularity. Leveraging UNet architecture and Proximal Policy Optimization (PPO), we achieved a model that is comparable to the MaxRect heuristic. However, our approach has great potential to be generalized to nonrectangular packing problems and complex constraints.

[LG-91] Model Selection Through Model Sorting

链接: https://arxiv.org/abs/2409.09674
作者: Mohammad Ali Hajiani,Babak Seyfe
关键词-EN: model, risk, approach to select, model order, risk minimizer predictor
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 55 pages, 4 figures, submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence, October 26, 2023

点击查看摘要

Abstract:We propose a novel approach to select the best model of the data. Based on the exclusive properties of the nested models, we find the most parsimonious model containing the risk minimizer predictor. We prove the existence of probable approximately correct (PAC) bounds on the difference of the minimum empirical risk of two successive nested models, called successive empirical excess risk (SEER). Based on these bounds, we propose a model order selection method called nested empirical risk (NER). By the sorted NER (S-NER) method to sort the models intelligently, the minimum risk decreases. We construct a test that predicts whether expanding the model decreases the minimum risk or not. With a high probability, the NER and S-NER choose the true model order and the most parsimonious model containing the risk minimizer predictor, respectively. We use S-NER model selection in the linear regression and show that, the S-NER method without any prior information can outperform the accuracy of feature sorting algorithms like orthogonal matching pursuit (OMP) that aided with prior knowledge of the true model order. Also, in the UCR data set, the NER method reduces the complexity of the classification of UCR datasets dramatically, with a negligible loss of accuracy.

[LG-92] KAN v.s. MLP for Offline Reinforcement Learning

链接: https://arxiv.org/abs/2409.09653
作者: Haihong Guo,Fengxin Li,Jiao Li,Hongyan Liu
关键词-EN: emerging neural network, neural network architecture, emerging neural, KAN, Kolmogorov-Arnold Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 5 pages,2 figures

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KAN) is an emerging neural network architecture in machine learning. It has greatly interested the research community about whether KAN can be a promising alternative of the commonly used Multi-Layer Perceptions (MLP). Experiments in various fields demonstrated that KAN-based machine learning can achieve comparable if not better performance than MLP-based methods, but with much smaller parameter scales and are more explainable. In this paper, we explore the incorporation of KAN into the actor and critic networks for offline reinforcement learning (RL). We evaluated the performance, parameter scales, and training efficiency of various KAN and MLP based conservative Q-learning (CQL) on the the classical D4RL benchmark for offline RL. Our study demonstrates that KAN can achieve performance close to the commonly used MLP with significantly fewer parameters. This provides us an option to choose the base networks according to the requirements of the offline RL tasks.

[LG-93] COSCO: A Sharpness-Aware Training Framework for Few-shot Multivariate Time Series Classification CIKM’24

链接: https://arxiv.org/abs/2409.09645
作者: Jesus Barreda,Ashley Gomez,Ruben Puga,Kaixiong Zhou,Li Zhang
关键词-EN: time series classification, Multivariate time series, time series, series classification, domains of applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 5 pages, 5 figures, CIKM '24 Short Paper Track

点击查看摘要

Abstract:Multivariate time series classification is an important task with widespread domains of applications. Recently, deep neural networks (DNN) have achieved state-of-the-art performance in time series classification. However, they often require large expert-labeled training datasets which can be infeasible in practice. In few-shot settings, i.e. only a limited number of samples per class are available in training data, DNNs show a significant drop in testing accuracy and poor generalization ability. In this paper, we propose to address these problems from an optimization and a loss function perspective. Specifically, we propose a new learning framework named COSCO consisting of a sharpness-aware minimization (SAM) optimization and a Prototypical loss function to improve the generalization ability of DNN for multivariate time series classification problems under few-shot setting. Our experiments demonstrate our proposed method outperforms the existing baseline methods. Our source code is available at: this https URL.

[LG-94] A Novel Framework For Text Detection From Natural Scene Images With Complex Background

链接: https://arxiv.org/abs/2409.09635
作者: Basavaraj Kaladagi,Jagadeesh Pujari
关键词-EN: Recognizing texts, hard problem, varied and complicated, Wavelet Transforms, Recognizing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Recognizing texts from camera images is a known hard problem because of the difficulties in text detection from the varied and complicated background. In this paper we propose a novel and efficient method to detect text region from images with complex background using Wavelet Transforms. The framework uses Wavelet Transformation of the original image in its grayscale form followed by Sub-band filtering. Then Region clustering technique is applied using centroids of the regions, further Bounding box is fitted to each region thus identifying the text regions. This method is much sophisticated and efficient than the previous methods as it doesn’t stick to a particular font size of the text thus, making it generalized. The sample set used for experimental purpose consists of 50 images with varying backgrounds. Images with edge prominence are considered. Furthermore, our method can be easily customized for applications with different scopes.

[LG-95] Understanding Simplicity Bias towards Compositional Mappings via Learning Dynamics

链接: https://arxiv.org/abs/2409.09626
作者: Yi Ren,Danica J. Sutherland
关键词-EN: Obtaining compositional mappings, Obtaining compositional, Obtaining, generalize well compositionally, compositional mappings
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 4 pages

点击查看摘要

Abstract:Obtaining compositional mappings is important for the model to generalize well compositionally. To better understand when and how to encourage the model to learn such mappings, we study their uniqueness through different perspectives. Specifically, we first show that the compositional mappings are the simplest bijections through the lens of coding length (i.e., an upper bound of their Kolmogorov complexity). This property explains why models having such mappings can generalize well. We further show that the simplicity bias is usually an intrinsic property of neural network training via gradient descent. That partially explains why some models spontaneously generalize well when they are trained appropriately.

[LG-96] HJ-sampler: A Bayesian sampler for inverse problems of a stochastic process by leveraging Hamilton-Jacobi PDEs and score-based generative models

链接: https://arxiv.org/abs/2409.09614
作者: Tingwei Meng,Zongren Zou,Jérôme Darbon,George Em Karniadakis
关键词-EN: stochastic processes, extensively explored, linear operator, stochastic, stochastic optimal control
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:The interplay between stochastic processes and optimal control has been extensively explored in the literature. With the recent surge in the use of diffusion models, stochastic processes have increasingly been applied to sample generation. This paper builds on the log transform, known as the Cole-Hopf transform in Brownian motion contexts, and extends it within a more abstract framework that includes a linear operator. Within this framework, we found that the well-known relationship between the Cole-Hopf transform and optimal transport is a particular instance where the linear operator acts as the infinitesimal generator of a stochastic process. We also introduce a novel scenario where the linear operator is the adjoint of the generator, linking to Bayesian inference under specific initial and terminal conditions. Leveraging this theoretical foundation, we develop a new algorithm, named the HJ-sampler, for Bayesian inference for the inverse problem of a stochastic differential equation with given terminal observations. The HJ-sampler involves two stages: (1) solving the viscous Hamilton-Jacobi partial differential equations, and (2) sampling from the associated stochastic optimal control problem. Our proposed algorithm naturally allows for flexibility in selecting the numerical solver for viscous HJ PDEs. We introduce two variants of the solver: the Riccati-HJ-sampler, based on the Riccati method, and the SGM-HJ-sampler, which utilizes diffusion models. We demonstrate the effectiveness and flexibility of the proposed methods by applying them to solve Bayesian inverse problems involving various stochastic processes and prior distributions, including applications that address model misspecifications and quantifying model uncertainty.

[LG-97] Integrating Audio Narrations to Strengthen Domain Generalization in Multimodal First-Person Action Recognition

链接: https://arxiv.org/abs/2409.09611
作者: Cagri Gungor,Adriana Kovashka
关键词-EN: First-person activity recognition, rapidly growing due, First-person activity, background scenes, rapidly growing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:First-person activity recognition is rapidly growing due to the widespread use of wearable cameras but faces challenges from domain shifts across different environments, such as varying objects or background scenes. We propose a multimodal framework that improves domain generalization by integrating motion, audio, and appearance features. Key contributions include analyzing the resilience of audio and motion features to domain shifts, using audio narrations for enhanced audio-text alignment, and applying consistency ratings between audio and visual narrations to optimize the impact of audio in recognition during training. Our approach achieves state-of-the-art performance on the ARGO1M dataset, effectively generalizing across unseen scenarios and locations.

[LG-98] owards Data-Centric RLHF: Simple Metrics for Preference Dataset Comparison

链接: https://arxiv.org/abs/2409.09603
作者: Judy Hanwen Shen,Archit Sharma,Jun Qin
关键词-EN: aligning language models, goal of aligning, aligning language, preferences requires data, human preferences requires
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Working Paper

点击查看摘要

Abstract:The goal of aligning language models to human preferences requires data that reveal these preferences. Ideally, time and money can be spent carefully collecting and tailoring bespoke preference data to each downstream application. However, in practice, a select few publicly available preference datasets are often used to train reward models for reinforcement learning from human feedback (RLHF). While new preference datasets are being introduced with increasing frequency, there are currently no existing efforts to measure and compare these datasets. In this paper, we systematically study preference datasets through three perspectives: scale, label noise, and information content. We propose specific metrics for each of these perspectives and uncover different axes of comparison for a better understanding of preference datasets. Our work is a first step towards a data-centric approach to alignment by providing perspectives that aid in training efficiency and iterative data collection for RLHF.

[LG-99] Open-World Test-Time Training: Self-Training with Contrast Learning

链接: https://arxiv.org/abs/2409.09591
作者: Houcheng Su,Mengzhu Wang,Jiao Li,Bingli Wang,Daixian Liu,Zeheng Wang
关键词-EN: Traditional test-time training, consistent class set, real-world scenarios characterized, addressing domain shifts, Traditional test-time
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10page

点击查看摘要

Abstract:Traditional test-time training (TTT) methods, while addressing domain shifts, often assume a consistent class set, limiting their applicability in real-world scenarios characterized by infinite variety. Open-World Test-Time Training (OWTTT) addresses the challenge of generalizing deep learning models to unknown target domain distributions, especially in the presence of strong Out-of-Distribution (OOD) data. Existing TTT methods often struggle to maintain performance when confronted with strong OOD data. In OWTTT, the focus has predominantly been on distinguishing between overall strong and weak OOD data. However, during the early stages of TTT, initial feature extraction is hampered by interference from strong OOD and corruptions, resulting in diminished contrast and premature classification of certain classes as strong OOD. To address this, we introduce Open World Dynamic Contrastive Learning (OWDCL), an innovative approach that utilizes contrastive learning to augment positive sample pairs. This strategy not only bolsters contrast in the early stages but also significantly enhances model robustness in subsequent stages. In comparison datasets, our OWDCL model has produced the most advanced performance.

[LG-100] Bias Begets Bias: The Impact of Biased Embeddings on Diffusion Models

链接: https://arxiv.org/abs/2409.09569
作者: Sahil Kuchlous,Marvin Li,Jeffrey G. Wang
关键词-EN: increased scrutiny, growing adoption, diffusion models, models, diffusion
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
*备注: 19 pages, 4 figures

点击查看摘要

Abstract:With the growing adoption of Text-to-Image (TTI) systems, the social biases of these models have come under increased scrutiny. Herein we conduct a systematic investigation of one such source of bias for diffusion models: embedding spaces. First, because traditional classifier-based fairness definitions require true labels not present in generative modeling, we propose statistical group fairness criteria based on a model’s internal representation of the world. Using these definitions, we demonstrate theoretically and empirically that an unbiased text embedding space for input prompts is a necessary condition for representationally balanced diffusion models, meaning the distribution of generated images satisfy diversity requirements with respect to protected attributes. Next, we investigate the impact of biased embeddings on evaluating the alignment between generated images and prompts, a process which is commonly used to assess diffusion models. We find that biased multimodal embeddings like CLIP can result in lower alignment scores for representationally balanced TTI models, thus rewarding unfair behavior. Finally, we develop a theoretical framework through which biases in alignment evaluation can be studied and propose bias mitigation methods. By specifically adapting the perspective of embedding spaces, we establish new fairness conditions for diffusion model development and evaluation.

[LG-101] Evaluating authenticity and quality of image captions via sentiment and semantic analyses

链接: https://arxiv.org/abs/2409.09560
作者: Aleksei Krotov,Alison Tebo,Dylan K. Picart,Aaron Dean Algave
关键词-EN: natural language processing, relies heavily, growth of deep, heavily on huge, huge amounts
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growth of deep learning (DL) relies heavily on huge amounts of labelled data for tasks such as natural language processing and computer vision. Specifically, in image-to-text or image-to-image pipelines, opinion (sentiment) may be inadvertently learned by a model from human-generated image captions. Additionally, learning may be affected by the variety and diversity of the provided captions. While labelling large datasets has largely relied on crowd-sourcing or data-worker pools, evaluating the quality of such training data is crucial. This study proposes an evaluation method focused on sentiment and semantic richness. That method was applied to the COCO-MS dataset, comprising approximately 150K images with segmented objects and corresponding crowd-sourced captions. We employed pre-trained models (Twitter-RoBERTa-base and BERT-base) to extract sentiment scores and variability of semantic embeddings from captions. The relation of the sentiment score and semantic variability with object categories was examined using multiple linear regression. Results indicate that while most captions were neutral, about 6% of the captions exhibited strong sentiment influenced by specific object categories. Semantic variability of within-image captions remained low and uncorrelated with object categories. Model-generated captions showed less than 1.5% of strong sentiment which was not influenced by object categories and did not correlate with the sentiment of the respective human-generated captions. This research demonstrates an approach to assess the quality of crowd- or worker-sourced captions informed by image content. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2409.09560 [cs.CV] (or arXiv:2409.09560v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.09560 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-102] A Statistical Viewpoint on Differential Privacy: Hypothesis Testing Representation and Blackwells Theorem

链接: https://arxiv.org/abs/2409.09558
作者: Weijie J. Su
关键词-EN: increasingly broad adoption, Differential privacy, privacy, Differential, rigorous guarantees
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: To appear in Annual Review of Statistics and Its Application

点击查看摘要

Abstract:Differential privacy is widely considered the formal privacy for privacy-preserving data analysis due to its robust and rigorous guarantees, with increasingly broad adoption in public services, academia, and industry. Despite originating in the cryptographic context, in this review paper we argue that, fundamentally, differential privacy can be considered a \textitpure statistical concept. By leveraging a theorem due to David Blackwell, our focus is to demonstrate that the definition of differential privacy can be formally motivated from a hypothesis testing perspective, thereby showing that hypothesis testing is not merely convenient but also the right language for reasoning about differential privacy. This insight leads to the definition of f -differential privacy, which extends other differential privacy definitions through a representation theorem. We review techniques that render f -differential privacy a unified framework for analyzing privacy bounds in data analysis and machine learning. Applications of this differential privacy definition to private deep learning, private convex optimization, shuffled mechanisms, and U.S.~Census data are discussed to highlight the benefits of analyzing privacy bounds under this framework compared to existing alternatives.

[LG-103] Enhancing Printed Circuit Board Defect Detection through Ensemble Learning

链接: https://arxiv.org/abs/2409.09555
作者: Ka Nam Canaan Law,Mingshuo Yu,Lianglei Zhang,Yiyi Zhang,Peng Xu,Jerry Gao,Jun Liu
关键词-EN: printed circuit boards, electronic device technology, advancing electronic device, circuit boards, device technology
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The quality control of printed circuit boards (PCBs) is paramount in advancing electronic device technology. While numerous machine learning methodologies have been utilized to augment defect detection efficiency and accuracy, previous studies have predominantly focused on optimizing individual models for specific defect types, often overlooking the potential synergies between different approaches. This paper introduces a comprehensive inspection framework leveraging an ensemble learning strategy to address this gap. Initially, we utilize four distinct PCB defect detection models utilizing state-of-the-art methods: EfficientDet, MobileNet SSDv2, Faster RCNN, and YOLOv5. Each method is capable of identifying PCB defects independently. Subsequently, we integrate these models into an ensemble learning framework to enhance detection performance. A comparative analysis reveals that our ensemble learning framework significantly outperforms individual methods, achieving a 95% accuracy in detecting diverse PCB defects. These findings underscore the efficacy of our proposed ensemble learning framework in enhancing PCB quality control processes.

[LG-104] COMFORT: A Continual Fine-Tuning Framework for Foundation Models Targeted at Consumer Healthcare

链接: https://arxiv.org/abs/2409.09549
作者: Chia-Hao Li,Niraj K. Jha
关键词-EN: Wearable medical sensors, Wearable medical, revolutionizing smart healthcare, medical sensors, enabling continuous
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 25 pages, 10 figures. This work has been submitted to the ACM for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Wearable medical sensors (WMSs) are revolutionizing smart healthcare by enabling continuous, real-time monitoring of user physiological signals, especially in the field of consumer healthcare. The integration of WMSs and modern machine learning (ML) enables unprecedented solutions to efficient early-stage disease detection. Despite the success of Transformers in various fields, their application to sensitive domains, such as smart healthcare, remains underexplored due to limited data accessibility and privacy concerns. To bridge the gap between Transformer-based foundation models and WMS-based disease detection, we propose COMFORT, a continual fine-tuning framework for foundation models targeted at consumer healthcare. COMFORT introduces a novel approach for pre-training a Transformer-based foundation model on a large dataset of physiological signals exclusively collected from healthy individuals with commercially available WMSs. We adopt a masked data modeling (MDM) objective to pre-train this health foundation model. We then fine-tune the model using various parameter-efficient fine-tuning (PEFT) methods, such as low-rank adaptation (LoRA) and its variants, to adapt it to various downstream disease detection tasks that rely on WMS data. In addition, COMFORT continually stores the low-rank decomposition matrices obtained from the PEFT algorithms to construct a library for multi-disease detection. The COMFORT library enables scalable and memory-efficient disease detection on edge devices. Our experimental results demonstrate that COMFORT achieves highly competitive performance while reducing memory overhead by up to 52% relative to conventional methods. Thus, COMFORT paves the way for personalized and proactive solutions to efficient and effective early-stage disease detection for consumer healthcare.

[LG-105] Multi-Microphone and Multi-Modal Emotion Recognition in Reverbrant Enviroment

链接: https://arxiv.org/abs/2409.09545
作者: Ohad Cohen,Gershon Hazan,Sharon Gannot
关键词-EN: Multi-modal Emotion Recognition, enhance emotion recognition, emotion recognition accuracy, Emotion Recognition, Multi-modal Emotion
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This paper presents a Multi-modal Emotion Recognition (MER) system designed to enhance emotion recognition accuracy in challenging acoustic conditions. Our approach combines a modified and extended Hierarchical Token-semantic Audio Transformer (HTS-AT) for multi-channel audio processing with an R(2+1)D Convolutional Neural Networks (CNN) model for video analysis. We evaluate our proposed method on a reverberated version of the Ryerson audio-visual database of emotional speech and song (RAVDESS) dataset using synthetic and real-world Room Impulse Responsess (RIRs). Our results demonstrate that integrating audio and video modalities yields superior performance compared to uni-modal approaches, especially in challenging acoustic conditions. Moreover, we show that the multimodal (audiovisual) approach that utilizes multiple microphones outperforms its single-microphone counterpart.

[LG-106] Autonomous Goal Detection and Cessation in Reinforcement Learning: A Case Study on Source Term Estimation

链接: https://arxiv.org/abs/2409.09541
作者: Yiwei Shi,Muning Wen,Qi Zhang,Weinan Zhang,Cunjia Liu,Weiru Liu
关键词-EN: Reinforcement Learning, revolutionized decision-making processes, clear feedback signals, Source Term Estimation, Learning has revolutionized
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning has revolutionized decision-making processes in dynamic environments, yet it often struggles with autonomously detecting and achieving goals without clear feedback signals. For example, in a Source Term Estimation problem, the lack of precise environmental information makes it challenging to provide clear feedback signals and to define and evaluate how the source’s location is determined. To address this challenge, the Autonomous Goal Detection and Cessation (AGDC) module was developed, enhancing various RL algorithms by incorporating a self-feedback mechanism for autonomous goal detection and cessation upon task completion. Our method effectively identifies and ceases undefined goals by approximating the agent’s belief, significantly enhancing the capabilities of RL algorithms in environments with limited feedback. To validate effectiveness of our approach, we integrated AGDC with deep Q-Network, proximal policy optimization, and deep deterministic policy gradient algorithms, and evaluated its performance on the Source Term Estimation problem. The experimental results showed that AGDC-enhanced RL algorithms significantly outperformed traditional statistical methods such as infotaxis, entrotaxis, and dual control for exploitation and exploration, as well as a non-statistical random action selection method. These improvements were evident in terms of success rate, mean traveled distance, and search time, highlighting AGDC’s effectiveness and efficiency in complex, real-world scenarios.

[LG-107] Deep Fast Machine Learning Utils: A Python Library for Streamlined Machine Learning Prototyping

链接: https://arxiv.org/abs/2409.09537
作者: Fabi Prezja
关键词-EN: Machine Learning Utils, Fast Machine Learning, involve time-consuming steps, Deep Fast Machine, model architecture prototyping
类目: Machine Learning (cs.LG)
*备注: 9 pages, 1 figure

点击查看摘要

Abstract:Machine learning (ML) research and application often involve time-consuming steps such as model architecture prototyping, feature selection, and dataset preparation. To support these tasks, we introduce the Deep Fast Machine Learning Utils (DFMLU) library, which provides tools designed to automate and enhance aspects of these processes. Compatible with frameworks like TensorFlow, Keras, and Scikit-learn, DFMLU offers functionalities that support model development and data handling. The library includes methods for dense neural network search, advanced feature selection, and utilities for data management and visualization of training outcomes. This manuscript presents an overview of DFMLU’s functionalities, providing Python examples for each tool.

[LG-108] Using Synthetic Data to Mitigate Unfairness and Preserve Privacy through Single-Shot Federated Learning

链接: https://arxiv.org/abs/2409.09532
作者: Chia-Yuan Wu,Frank E. Curtis,Daniel P. Robinson
关键词-EN: contemporary approaches typically, frequent model parameter, model parameter updates, address unfairness issues, contemporary approaches
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:To address unfairness issues in federated learning (FL), contemporary approaches typically use frequent model parameter updates and transmissions between the clients and server. In such a process, client-specific information (e.g., local dataset size or data-related fairness metrics) must be sent to the server to compute, e.g., aggregation weights. All of this results in high transmission costs and the potential leakage of client information. As an alternative, we propose a strategy that promotes fair predictions across clients without the need to pass information between the clients and server iteratively and prevents client data leakage. For each client, we first use their local dataset to obtain a synthetic dataset by solving a bilevel optimization problem that addresses unfairness concerns during the learning process. We then pass each client’s synthetic dataset to the server, the collection of which is used to train the server model using conventional machine learning techniques (that do not take fairness metrics into account). Thus, we eliminate the need to handle fairness-specific aggregation weights while preserving client privacy. Our approach requires only a single communication between the clients and the server, thus making it computationally cost-effective, able to maintain privacy, and able to ensuring fairness. We present empirical evidence to demonstrate the advantages of our approach. The results illustrate that our method effectively uses synthetic data as a means to mitigate unfairness and preserve client privacy.

[LG-109] Planning Transformer: Long-Horizon Offline Reinforcement Learning with Planning Tokens AAAI

链接: https://arxiv.org/abs/2409.09513
作者: Joseph Clinton,Robert Lieck
关键词-EN: Supervised learning approaches, Decision Transformer, offline reinforcement learning, Supervised learning, utilizing the Decision
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 11 pages, 5 figures, Submitted to AAAI

点击查看摘要

Abstract:Supervised learning approaches to offline reinforcement learning, particularly those utilizing the Decision Transformer, have shown effectiveness in continuous environments and for sparse rewards. However, they often struggle with long-horizon tasks due to the high compounding error of auto-regressive models. To overcome this limitation, we go beyond next-token prediction and introduce Planning Tokens, which contain high-level, long time-scale information about the agent’s future. Predicting dual time-scale tokens at regular intervals enables our model to use these long-horizon Planning Tokens as a form of implicit planning to guide its low-level policy and reduce compounding error. This architectural modification significantly enhances performance on long-horizon tasks, establishing a new state-of-the-art in complex D4RL environments. Additionally, we demonstrate that Planning Tokens improve the interpretability of the model’s policy through the interpretable plan visualisations and attention map.

[LG-110] MALADY: Multiclass Active Learning with Auction Dynamics on Graphs

链接: https://arxiv.org/abs/2409.09475
作者: Gokul Bhusal,Kevin Miller,Ekaterina Merkurjev
关键词-EN: unlabeled data points, Active learning enhances, Multiclass Active Learning, auction dynamics algorithm, Active learning
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Active learning enhances the performance of machine learning methods, particularly in semi-supervised cases, by judiciously selecting a limited number of unlabeled data points for labeling, with the goal of improving the performance of an underlying classifier. In this work, we introduce the Multiclass Active Learning with Auction Dynamics on Graphs (MALADY) framework which leverages the auction dynamics algorithm on similarity graphs for efficient active learning. In particular, we generalize the auction dynamics algorithm on similarity graphs for semi-supervised learning in [24] to incorporate a more general optimization functional. Moreover, we introduce a novel active learning acquisition function that uses the dual variable of the auction algorithm to measure the uncertainty in the classifier to prioritize queries near the decision boundaries between different classes. Lastly, using experiments on classification tasks, we evaluate the performance of our proposed method and show that it exceeds that of comparison algorithms.

[LG-111] Learning to enhance multi-legged robot on rugged landscapes ICRA2025

链接: https://arxiv.org/abs/2409.09473
作者: Juntao He,Baxi Chong,Zhaochen Xu,Sehoon Ha,Daniel I. Goldman
关键词-EN: Navigating rugged landscapes, rugged landscapes poses, landscapes poses significant, poses significant challenges, Navigating rugged
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Submitted to ICRA 2025

点击查看摘要

Abstract:Navigating rugged landscapes poses significant challenges for legged locomotion. Multi-legged robots (those with 6 and greater) offer a promising solution for such terrains, largely due to their inherent high static stability, resulting from a low center of mass and wide base of support. Such systems require minimal effort to maintain balance. Recent studies have shown that a linear controller, which modulates the vertical body undulation of a multi-legged robot in response to shifts in terrain roughness, can ensure reliable mobility on challenging terrains. However, the potential of a learning-based control framework that adjusts multiple parameters to address terrain heterogeneity remains underexplored. We posit that the development of an experimentally validated physics-based simulator for this robot can rapidly advance capabilities by allowing wide parameter space exploration. Here we develop a MuJoCo-based simulator tailored to this robotic platform and use the simulation to develop a reinforcement learning-based control framework that dynamically adjusts horizontal and vertical body undulation, and limb stepping in real-time. Our approach improves robot performance in simulation, laboratory experiments, and outdoor tests. Notably, our real-world experiments reveal that the learning-based controller achieves a 30% to 50% increase in speed compared to a linear controller, which only modulates vertical body waves. We hypothesize that the superior performance of the learning-based controller arises from its ability to adjust multiple parameters simultaneously, including limb stepping, horizontal body wave, and vertical body wave.

[LG-112] X-Gen: Multi-Objective Optimization for Sparse Counterfactual Explanations for Time-Series Classification

链接: https://arxiv.org/abs/2409.09461
作者: Qi Huang,Sofoklis Kitharidis,Thomas Bäck,Niki van Stein
关键词-EN: understanding model decisions, healthcare and finance, decisions is crucial, application in high-stakes, high-stakes domains
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: Preprint, under review

点击查看摘要

Abstract:In time-series classification, understanding model decisions is crucial for their application in high-stakes domains such as healthcare and finance. Counterfactual explanations, which provide insights by presenting alternative inputs that change model predictions, offer a promising solution. However, existing methods for generating counterfactual explanations for time-series data often struggle with balancing key objectives like proximity, sparsity, and validity. In this paper, we introduce TX-Gen, a novel algorithm for generating counterfactual explanations based on the Non-dominated Sorting Genetic Algorithm II (NSGA-II). TX-Gen leverages evolutionary multi-objective optimization to find a diverse set of counterfactuals that are both sparse and valid, while maintaining minimal dissimilarity to the original time series. By incorporating a flexible reference-guided mechanism, our method improves the plausibility and interpretability of the counterfactuals without relying on predefined assumptions. Extensive experiments on benchmark datasets demonstrate that TX-Gen outperforms existing methods in generating high-quality counterfactuals, making time-series models more transparent and interpretable.

[LG-113] On the Generalizability of Foundation Models for Crop Type Mapping

链接: https://arxiv.org/abs/2409.09451
作者: Yi-Chia Chang,Adam J. Stewart,Favyen Bastani,Piper Wolters,Shreya Kannan,George R. Huber,Jingtong Wang,Arindam Banerjee
关键词-EN: including language understanding, Foundation models pre-trained, shown powerful transfer, Foundation models, text generation
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Foundation models pre-trained using self-supervised and weakly-supervised learning have shown powerful transfer learning capabilities on various downstream tasks, including language understanding, text generation, and image recognition. Recently, the Earth observation (EO) field has produced several foundation models pre-trained directly on multispectral satellite imagery (e.g., Sentinel-2) for applications like precision agriculture, wildfire and drought monitoring, and natural disaster response. However, few studies have investigated the ability of these models to generalize to new geographic locations, and potential concerns of geospatial bias – models trained on data-rich developed countries not transferring well to data-scarce developing countries – remain. We investigate the ability of popular EO foundation models to transfer to new geographic regions in the agricultural domain, where differences in farming practices and class imbalance make transfer learning particularly challenging. We first select six crop classification datasets across five continents, normalizing for dataset size and harmonizing classes to focus on four major cereal grains: maize, soybean, rice, and wheat. We then compare three popular foundation models, pre-trained on SSL4EO-S12, SatlasPretrain, and ImageNet, using in-distribution (ID) and out-of-distribution (OOD) evaluation. Experiments show that pre-trained weights designed explicitly for Sentinel-2, such as SSL4EO-S12, outperform general pre-trained weights like ImageNet. Furthermore, the benefits of pre-training on OOD data are the most significant when only 10–100 ID training samples are used. Transfer learning and pre-training with OOD and limited ID data show promising applications, as many developing regions have scarce crop type labels. All harmonized datasets and experimental code are open-source and available for download.

[LG-114] PIP-Loco: A Proprioceptive Infinite Horizon Planning Framework for Quadrupedal Robot Locomotion

链接: https://arxiv.org/abs/2409.09441
作者: Aditya Shirwatkar,Naman Saxena,Kishore Chandra,Shishir Kolathaya
关键词-EN: Model Predictive Control, Predictive Control, Model Predictive, core strength, provide interpretability
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Preprint under review

点击查看摘要

Abstract:A core strength of Model Predictive Control (MPC) for quadrupedal locomotion has been its ability to enforce constraints and provide interpretability of the sequence of commands over the horizon. However, despite being able to plan, MPC struggles to scale with task complexity, often failing to achieve robust behavior on rapidly changing surfaces. On the other hand, model-free Reinforcement Learning (RL) methods have outperformed MPC on multiple terrains, showing emergent motions but inherently lack any ability to handle constraints or perform planning. To address these limitations, we propose a framework that integrates proprioceptive planning with RL, allowing for agile and safe locomotion behaviors through the horizon. Inspired by MPC, we incorporate an internal model that includes a velocity estimator and a Dreamer module. During training, the framework learns an expert policy and an internal model that are co-dependent, facilitating exploration for improved locomotion behaviors. During deployment, the Dreamer module solves an infinite-horizon MPC problem, adapting actions and velocity commands to respect the constraints. We validate the robustness of our training framework through ablation studies on internal model components and demonstrate improved robustness to training noise. Finally, we evaluate our approach across multi-terrain scenarios in both simulation and hardware.

[LG-115] Distributed Clustering based on Distributional Kernel

链接: https://arxiv.org/abs/2409.09418
作者: Hang Zhang,Yang Xu,Lei Gong,Ye Zhu,Kai Ming Ting
关键词-EN: final clusters based, Distributed Clustering based, clustering, produces the final, similarity with respect
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper introduces a new framework for clustering in a distributed network called Distributed Clustering based on Distributional Kernel (K) or KDC that produces the final clusters based on the similarity with respect to the distributions of initial clusters, as measured by K. It is the only framework that satisfies all three of the following properties. First, KDC guarantees that the combined clustering outcome from all sites is equivalent to the clustering outcome of its centralized counterpart from the combined dataset from all sites. Second, the maximum runtime cost of any site in distributed mode is smaller than the runtime cost in centralized mode. Third, it is designed to discover clusters of arbitrary shapes, sizes and densities. To the best of our knowledge, this is the first distributed clustering framework that employs a distributional kernel. The distribution-based clustering leads directly to significantly better clustering outcomes than existing methods of distributed clustering. In addition, we introduce a new clustering algorithm called Kernel Bounded Cluster Cores, which is the best clustering algorithm applied to KDC among existing clustering algorithms. We also show that KDC is a generic framework that enables a quadratic time clustering algorithm to deal with large datasets that would otherwise be impossible.

[LG-116] Enhancing LLM Problem Solving with REAP: Reflection Explicit Problem Deconstruction and Advanced Prompting

链接: https://arxiv.org/abs/2409.09415
作者: Ryan Lingo,Martin Arroyo,Rajeev Chhajer
关键词-EN: Large Language Models, natural language processing, transformed natural language, Large Language, Explicit Problem Deconstruction
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 524 pages, 3 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have transformed natural language processing, yet improving their problem-solving capabilities, particularly for complex, reasoning-intensive tasks, remains a persistent challenge. This paper introduces the REAP (Reflection, Explicit Problem Deconstruction, and Advanced Prompting) method, an innovative approach within the dynamic context generation framework. REAP guides LLMs through reflection on the query, deconstructing it into manageable components, and generating relevant context to enhance the solution process. We evaluated REAP using a dataset designed to expose LLM limitations, comparing zero-shot prompting with REAP-enhanced prompts across six state-of-the-art models: OpenAI’s o1-preview, o1-mini, GPT-4o, GPT-4o-mini, Google’s Gemini 1.5 Pro, and Claude 3.5 Sonnet. The results demonstrate notable performance gains, with o1-mini improving by 40.97%, GPT-4o by 66.26%, and GPT-4o-mini by 112.93%. Despite the already strong baseline performance of OpenAI’s o1-preview, modest gains were observed. Beyond performance improvements, REAP offers a cost-effective solution; for example, GPT-4o-mini, which is approximately 100 times cheaper than o1-preview, delivered competitive results. REAP also improves the clarity of model outputs, making it easier for humans to understand the reasoning behind the results and simplifying the process of identifying and addressing any issues. These findings demonstrate REAP’s potential to greatly improve the capabilities of LLMs, providing both better performance and increased cost-efficiency across a wide range of applications.

[LG-117] Weather Prediction Using CNN-LSTM for Time Series Analysis: A Case Study on Delhi Temperature Data

链接: https://arxiv.org/abs/2409.09414
作者: Bangyu Li,Yang Qian
关键词-EN: climate change intensifies, global climate change, accurate weather forecasting, energy management, change intensifies
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:As global climate change intensifies, accurate weather forecasting is increasingly crucial for sectors such as agriculture, energy management, and environmental protection. Traditional methods, which rely on physical and statistical models, often struggle with complex, nonlinear, and time-varying data, underscoring the need for more advanced techniques. This study explores a hybrid CNN-LSTM model to enhance temperature forecasting accuracy for the Delhi region, using historical meteorological data from 1996 to 2017. We employed both direct and indirect methods, including comprehensive data preprocessing and exploratory analysis, to construct and train our model. The CNN component effectively extracts spatial features, while the LSTM captures temporal dependencies, leading to improved prediction accuracy. Experimental results indicate that the CNN-LSTM model significantly outperforms traditional forecasting methods in terms of both accuracy and stability, with a mean square error (MSE) of 3.26217 and a root mean square error (RMSE) of 1.80615. The hybrid model demonstrates its potential as a robust tool for temperature prediction, offering valuable insights for meteorological forecasting and related fields. Future research should focus on optimizing model architecture, exploring additional feature extraction techniques, and addressing challenges such as overfitting and computational complexity. This approach not only advances temperature forecasting but also provides a foundation for applying deep learning to other time series forecasting tasks.

[LG-118] Real-world Adversarial Defense against Patch Attacks based on Diffusion Model

链接: https://arxiv.org/abs/2409.09406
作者: Xingxing Wei,Caixin Kang,Yinpeng Dong,Zhengyi Wang,Shouwei Ruan,Yubo Chen,Hang Su
关键词-EN: deep learning models, diffusion model, Adversarial patches present, present significant challenges, Adversarial Anomaly Perception
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adversarial patches present significant challenges to the robustness of deep learning models, making the development of effective defenses become critical for real-world applications. This paper introduces DIFFender, a novel DIFfusion-based DeFender framework that leverages the power of a text-guided diffusion model to counter adversarial patch attacks. At the core of our approach is the discovery of the Adversarial Anomaly Perception (AAP) phenomenon, which enables the diffusion model to accurately detect and locate adversarial patches by analyzing distributional anomalies. DIFFender seamlessly integrates the tasks of patch localization and restoration within a unified diffusion model framework, enhancing defense efficacy through their close interaction. Additionally, DIFFender employs an efficient few-shot prompt-tuning algorithm, facilitating the adaptation of the pre-trained diffusion model to defense tasks without the need for extensive retraining. Our comprehensive evaluation, covering image classification and face recognition tasks, as well as real-world scenarios, demonstrates DIFFender’s robust performance against adversarial attacks. The framework’s versatility and generalizability across various settings, classifiers, and attack methodologies mark a significant advancement in adversarial patch defense strategies. Except for the popular visible domain, we have identified another advantage of DIFFender: its capability to easily expand into the infrared domain. Consequently, we demonstrate the good flexibility of DIFFender, which can defend against both infrared and visible adversarial patch attacks alternatively using a universal defense framework.

[LG-119] LLM-Powered Ensemble Learning for Paper Source Tracing: A GPU-Free Approach

链接: https://arxiv.org/abs/2409.09383
作者: Kunlong Chen,Junjun Wang,Zhaoqun Chen,Kunjin Chen,Yitian Chen
关键词-EN: KDD CUP, source tracing competition, paper source tracing, tracing competition, source tracing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:We participated in the KDD CUP 2024 paper source tracing competition and achieved the 3rd place. This competition tasked participants with identifying the reference sources (i.e., ref-sources, as referred to by the organizers of the competition) of given academic papers. Unlike most teams that addressed this challenge by fine-tuning pre-trained neural language models such as BERT or ChatGLM, our primary approach utilized closed-source large language models (LLMs). With recent advancements in LLM technology, closed-source LLMs have demonstrated the capability to tackle complex reasoning tasks in zero-shot or few-shot scenarios. Consequently, in the absence of GPUs, we employed closed-source LLMs to directly generate predicted reference sources from the provided papers. We further refined these predictions through ensemble learning. Notably, our method was the only one among the award-winning approaches that did not require the use of GPUs for model training. Code available at this https URL.

[LG-120] BM2: Coupled Schr"odinger Bridge Matching

链接: https://arxiv.org/abs/2409.09376
作者: Stefano Peluchetti
关键词-EN: optimal transport problem, entropic optimal transport, dynamic transport map, Schrödinger bridge establishes, transport problem
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:A Schrödinger bridge establishes a dynamic transport map between two target distributions via a reference process, simultaneously solving an associated entropic optimal transport problem. We consider the setting where samples from the target distributions are available, and the reference diffusion process admits tractable dynamics. We thus introduce Coupled Bridge Matching (BM ^2 ), a simple \emphnon-iterative approach for learning Schrödinger bridges with neural networks. A preliminary theoretical analysis of the convergence properties of BM ^2 is carried out, supported by numerical experiments that demonstrate the effectiveness of our proposal.

[LG-121] Beta-Sigma VAE: Separating beta and decoder variance in Gaussian variational autoencoder ICPR2024

链接: https://arxiv.org/abs/2409.09361
作者: Seunghwan Kim,Seungkyu Lee
关键词-EN: Variational autoencoder, established generative model, established generative, beta, VAE
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: Accepted for ICPR 2024

点击查看摘要

Abstract:Variational autoencoder (VAE) is an established generative model but is notorious for its blurriness. In this work, we investigate the blurry output problem of VAE and resolve it, exploiting the variance of Gaussian decoder and \beta of beta-VAE. Specifically, we reveal that the indistinguishability of decoder variance and \beta hinders appropriate analysis of the model by random likelihood value, and limits performance improvement by omitting the gain from \beta . To address the problem, we propose Beta-Sigma VAE (BS-VAE) that explicitly separates \beta and decoder variance \sigma^2_x in the model. Our method demonstrates not only superior performance in natural image synthesis but also controllable parameters and predictable analysis compared to conventional VAE. In our experimental evaluation, we employ the analysis of rate-distortion curve and proxy metrics on computer vision datasets. The code is available on this https URL

[LG-122] Symbolic Regression with a Learned Concept Library

链接: https://arxiv.org/abs/2409.09359
作者: Arya Grayeli,Atharva Sehgal,Omar Costilla-Reyes,Miles Cranmer,Swarat Chaudhuri
关键词-EN: compact programmatic hypotheses, symbolic regression, explain a dataset, searching for compact, compact programmatic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Symbolic Computation (cs.SC)
*备注: preprint version; 10 pages

点击查看摘要

Abstract:We present a novel method for symbolic regression (SR), the task of searching for compact programmatic hypotheses that best explain a dataset. The problem is commonly solved using genetic algorithms; we show that we can enhance such methods by inducing a library of abstract textual concepts. Our algorithm, called LaSR, uses zero-shot queries to a large language model (LLM) to discover and evolve concepts occurring in known high-performing hypotheses. We discover new hypotheses using a mix of standard evolutionary steps and LLM-guided steps (obtained through zero-shot LLM queries) conditioned on discovered concepts. Once discovered, hypotheses are used in a new round of concept abstraction and evolution. We validate LaSR on the Feynman equations, a popular SR benchmark, as well as a set of synthetic tasks. On these benchmarks, LaSR substantially outperforms a variety of state-of-the-art SR approaches based on deep learning and evolutionary algorithms. Moreover, we show that LaSR can be used to discover a novel and powerful scaling law for LLMs.

[LG-123] Overcoming linguistic barriers in code assistants: creating a QLoRA adapter to improve support for Russian-language code writing instructions

链接: https://arxiv.org/abs/2409.09353
作者: C. B. Pronin,A. V. Volosova,A. V. Ostroukh,Yu. N. Strogov
关键词-EN: popular language model, Russian language, base model, model, Russian
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 10 pages, 4 figures

点击查看摘要

Abstract:In this paper, an approach to training and evaluating an adapter model for the popular language model “zephyr-7b-beta” is described. The adapter was developed to improve the performance of the base model in tasks related to programming and understanding the Russian language. Considering the high quality of the original model in tasks in the English language, the goal of the research was to expand its linguistic and technical spectrum. The proposed adapter was trained using a large and diverse dataset, including question-answer pairs related to programming, as well code-related texts in Russian language. The applied training methodology ensures an improvement in the model’s quality of answers in understanding and generating Python code based on Russian instructions. We evaluated the performance of the base model with the installed adapter using various metrics, comparing it to the base model as well as other state-of-the-art models in this field. The obtained results showed significant improvement, both in tasks related to writing Python code and in processing the Russian language, confirming the effectiveness of the proposed adapter.

[LG-124] Schr"odinger Bridge Flow for Unpaired Data Translation

链接: https://arxiv.org/abs/2409.09347
作者: Valentin De Bortoli,Iryna Korshunova,Andriy Mnih,Arnaud Doucet
关键词-EN: Mass transport problems, Generative Adversarial Networks, Schrödinger Bridge, Mass transport, transport problems arise
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Mass transport problems arise in many areas of machine learning whereby one wants to compute a map transporting one distribution to another. Generative modeling techniques like Generative Adversarial Networks (GANs) and Denoising Diffusion Models (DDMs) have been successfully adapted to solve such transport problems, resulting in CycleGAN and Bridge Matching respectively. However, these methods do not approximate Optimal Transport (OT) maps, which are known to have desirable properties. Existing techniques approximating OT maps for high-dimensional data-rich problems, such as DDM-based Rectified Flow and Schrödinger Bridge procedures, require fully training a DDM-type model at each iteration, or use mini-batch techniques which can introduce significant errors. We propose a novel algorithm to compute the Schrödinger Bridge, a dynamic entropy-regularised version of OT, that eliminates the need to train multiple DDM-like models. This algorithm corresponds to a discretisation of a flow of path measures, which we call the Schrödinger Bridge Flow, whose only stationary point is the Schrödinger Bridge. We demonstrate the performance of our algorithm on a variety of unpaired data translation tasks.

[LG-125] he T05 System for The VoiceMOS Challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech

链接: https://arxiv.org/abs/2409.09305
作者: Kaito Baba,Wataru Nakata,Yuki Saito,Hiroshi Saruwatari
关键词-EN: VoiceMOS Challenge, VMC, Challenge, system, synthetic speech
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted by IEEE SLT 2024. Our MOS prediction system (UTMOSv2) is available in this https URL

点击查看摘要

Abstract:We present our system (denoted as T05) for the VoiceMOS Challenge (VMC) 2024. Our system was designed for the VMC 2024 Track 1, which focused on the accurate prediction of naturalness mean opinion score (MOS) for high-quality synthetic speech. In addition to a pretrained self-supervised learning (SSL)-based speech feature extractor, our system incorporates a pretrained image feature extractor to capture the difference of synthetic speech observed in speech spectrograms. We first separately train two MOS predictors that use either of an SSL-based or spectrogram-based feature. Then, we fine-tune the two predictors for better MOS prediction using the fusion of two extracted features. In the VMC 2024 Track 1, our T05 system achieved first place in 7 out of 16 evaluation metrics and second place in the remaining 9 metrics, with a significant difference compared to those ranked third and below. We also report the results of our ablation study to investigate essential factors of our system.

[LG-126] Consistent Spectral Clustering in Hyperbolic Spaces

链接: https://arxiv.org/abs/2409.09304
作者: Sagar Ghosh,Swagatam Das
关键词-EN: Euclidean Spaces, hyperbolic spaces, Spaces, Spectral Clustering, Clustering
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Currently under review in IEEE T-PAMI

点击查看摘要

Abstract:Clustering, as an unsupervised technique, plays a pivotal role in various data analysis applications. Among clustering algorithms, Spectral Clustering on Euclidean Spaces has been extensively studied. However, with the rapid evolution of data complexity, Euclidean Space is proving to be inefficient for representing and learning algorithms. Although Deep Neural Networks on hyperbolic spaces have gained recent traction, clustering algorithms or non-deep machine learning models on non-Euclidean Spaces remain underexplored. In this paper, we propose a spectral clustering algorithm on Hyperbolic Spaces to address this gap. Hyperbolic Spaces offer advantages in representing complex data structures like hierarchical and tree-like structures, which cannot be embedded efficiently in Euclidean Spaces. Our proposed algorithm replaces the Euclidean Similarity Matrix with an appropriate Hyperbolic Similarity Matrix, demonstrating improved efficiency compared to clustering in Euclidean Spaces. Our contributions include the development of the spectral clustering algorithm on Hyperbolic Spaces and the proof of its weak consistency. We show that our algorithm converges at least as fast as Spectral Clustering on Euclidean Spaces. To illustrate the efficacy of our approach, we present experimental results on the Wisconsin Breast Cancer Dataset, highlighting the superior performance of Hyperbolic Spectral Clustering over its Euclidean counterpart. This work opens up avenues for utilizing non-Euclidean Spaces in clustering algorithms, offering new perspectives for handling complex data structures and improving clustering efficiency.

[LG-127] Matrix Profile for Anomaly Detection on Multidimensional Time Series

链接: https://arxiv.org/abs/2409.09298
作者: Chin-Chia Michael Yeh,Audrey Der,Uday Singh Saini,Vivian Lai,Yan Zheng,Junpeng Wang,Xin Dai,Zhongfang Zhuang,Yujie Fan,Huiyuan Chen,Prince Osei Aboagye,Liang Wang,Wei Zhang,Eamonn Keogh
关键词-EN: time series, multidimensional time series, anomaly detection, series data mining, series anomaly detection
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:The Matrix Profile (MP), a versatile tool for time series data mining, has been shown effective in time series anomaly detection (TSAD). This paper delves into the problem of anomaly detection in multidimensional time series, a common occurrence in real-world applications. For instance, in a manufacturing factory, multiple sensors installed across the site collect time-varying data for analysis. The Matrix Profile, named for its role in profiling the matrix storing pairwise distance between subsequences of univariate time series, becomes complex in multidimensional scenarios. If the input univariate time series has n subsequences, the pairwise distance matrix is a n x n matrix. In a multidimensional time series with d dimensions, the pairwise distance information must be stored in a n x n x d tensor. In this paper, we first analyze different strategies for condensing this tensor into a profile vector. We then investigate the potential of extending the MP to efficiently find k-nearest neighbors for anomaly detection. Finally, we benchmark the multidimensional MP against 19 baseline methods on 119 multidimensional TSAD datasets. The experiments covers three learning setups: unsupervised, supervised, and semi-supervised. MP is the only method that consistently delivers high performance across all setups.

[LG-128] urbo your multi-modal classification with contrastive learning

链接: https://arxiv.org/abs/2409.09282
作者: Zhiyu Zhang,Da Liu,Shengqiang Liu,Anna Wang,Jie Gao,Yali Li
关键词-EN: Contrastive learning, impressive approaches, Contrastive, learning, cross-modal contrastive learning
类目: Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Contrastive learning has become one of the most impressive approaches for multi-modal representation learning. However, previous multi-modal works mainly focused on cross-modal understanding, ignoring in-modal contrastive learning, which limits the representation of each modality. In this paper, we propose a novel contrastive learning strategy, called Turbo , to promote multi-modal understanding by joint in-modal and cross-modal contrastive learning. Specifically, multi-modal data pairs are sent through the forward pass twice with different hidden dropout masks to get two different representations for each modality. With these representations, we obtain multiple in-modal and cross-modal contrastive objectives for training. Finally, we combine the self-supervised Turbo with the supervised multi-modal classification and demonstrate its effectiveness on two audio-text classification tasks, where the state-of-the-art performance is achieved on a speech emotion recognition benchmark dataset.

[LG-129] Language Models “Grok” to Copy

链接: https://arxiv.org/abs/2409.09281
作者: Ang Lv,Ruobing Xie,Xingwu Sun,Zhanhui Kang,Rui Yan
关键词-EN: LLM applications, including in-context learning, Transformer-based language models, retrieval-augmented generation, copy text
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 5 pages, 7 figures

点击查看摘要

Abstract:We examine the pre-training dynamics of language models, focusing on their ability to copy text from preceding context–a fundamental skill for various LLM applications, including in-context learning (ICL) and retrieval-augmented generation (RAG). We propose a novel perspective that Transformer-based language models develop copying abilities similarly to grokking, which refers to sudden generalization on test set long after the model fit to the training set. Our experiments yield three arguments: (1) The pre-training loss decreases rapidly, while the context copying ability of models initially lags and then abruptly saturates. (2) The speed of developing copying ability is independent of the number of tokens trained, similarly to how grokking speed is unaffected by dataset size as long as the data distribution is preserved. (3) Induction heads, the attention heads responsible for copying, form from shallow to deep layers during training, mirroring the development of circuits in deeper layers during grokking. We contend that the connection between grokking and context copying can provide valuable insights for more effective language model training, ultimately improving in-context performance. For example, we demonstrated that techniques that enhance grokking, such as regularization, either accelerate or enhance the development of context copying.

[LG-130] LabellessFace: Fair Metric Learning for Face Recognition without Attribute Labels

链接: https://arxiv.org/abs/2409.09274
作者: Tetsushi Ohki,Yuya Sato,Masakatsu Nishigaki,Koichi Ito
关键词-EN: major challenges, Demographic, Demographic bias, face recognition, recognition systems
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Demographic bias is one of the major challenges for face recognition systems. The majority of existing studies on demographic biases are heavily dependent on specific demographic groups or demographic classifier, making it difficult to address performance for unrecognised groups. This paper introduces ``LabellessFace’', a novel framework that improves demographic bias in face recognition without requiring demographic group labeling typically required for fairness considerations. We propose a novel fairness enhancement metric called the class favoritism level, which assesses the extent of favoritism towards specific classes across the dataset. Leveraging this metric, we introduce the fair class margin penalty, an extension of existing margin-based metric learning. This method dynamically adjusts learning parameters based on class favoritism levels, promoting fairness across all attributes. By treating each class as an individual in facial recognition systems, we facilitate learning that minimizes biases in authentication accuracy among individuals. Comprehensive experiments have demonstrated that our proposed method is effective for enhancing fairness while maintaining authentication accuracy.

[LG-131] Leveraging Foundation Models for Efficient Federated Learning in Resource-restricted Edge Networks

链接: https://arxiv.org/abs/2409.09273
作者: S. Kawa Atapour,S. Jamal SeyedMohammadi,S. Mohammad Sheikholeslami,Jamshid Abouei,Konstantinos N. Plataniotis,Arash Mohammadi
关键词-EN: Recently pre-trained Foundation, pre-trained Foundation Models, Recently pre-trained, Federated Learning, pre-trained Foundation
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently pre-trained Foundation Models (FMs) have been combined with Federated Learning (FL) to improve training of downstream tasks while preserving privacy. However, deploying FMs over edge networks with resource-constrained Internet of Things (IoT) devices is under-explored. This paper proposes a novel framework, namely, Federated Distilling knowledge to Prompt (FedD2P), for leveraging the robust representation abilities of a vision-language FM without deploying it locally on edge devices. This framework distills the aggregated knowledge of IoT devices to a prompt generator to efficiently adapt the frozen FM for downstream tasks. To eliminate the dependency on a public dataset, our framework leverages perclass local knowledge from IoT devices and linguistic descriptions of classes to train the prompt generator. Our experiments on diverse image classification datasets CIFAR, OxfordPets, SVHN, EuroSAT, and DTD show that FedD2P outperforms the baselines in terms of model performance.

[LG-132] Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks Domains and Knowledge Types

链接: https://arxiv.org/abs/2409.09269
作者: Neelabh Sinha,Vinija Jain,Aman Chadha
关键词-EN: aid user experience, Visual Question-Answering, achieving good results, user experience, zero-shot inference
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 8 pages + references + 6 pages of Appendix

点击查看摘要

Abstract:Visual Question-Answering (VQA) has become a key use-case in several applications to aid user experience, particularly after Vision-Language Models (VLMs) achieving good results in zero-shot inference. But evaluating different VLMs for an application requirement using a standardized framework in practical settings is still challenging. This paper introduces a comprehensive framework for evaluating VLMs tailored to VQA tasks in practical settings. We present a novel dataset derived from established VQA benchmarks, annotated with task types, application domains, and knowledge types, three key practical aspects on which tasks can vary. We also introduce GoEval, a multimodal evaluation metric developed using GPT-4o, achieving a correlation factor of 56.71% with human judgments. Our experiments with ten state-of-the-art VLMs reveals that no single model excelling universally, making appropriate selection a key design decision. Proprietary models such as Gemini-1.5-Pro and GPT-4o-mini generally outperform others, though open-source models like InternVL-2-8B and CogVLM-2-Llama-3-19B demonstrate competitive strengths in specific contexts, while providing additional advantages. This study guides the selection of VLMs based on specific task requirements and resource constraints, and can also be extended to other vision-language tasks.

[LG-133] Operational Wind Speed Forecasts for Chiles Electric Power Sector Using a Hybrid ML Model

链接: https://arxiv.org/abs/2409.09263
作者: Dhruv Suri,Praneet Dutta,Flora Xue,Ines Azevedo,Ravi Jain
关键词-EN: managing grid operations, electric power sector, power sector advances, Chile electric power, renewable energy sources
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:As Chile’s electric power sector advances toward a future powered by renewable energy, accurate forecasting of renewable generation is essential for managing grid operations. The integration of renewable energy sources is particularly challenging due to the operational difficulties of managing their power generation, which is highly variable compared to fossil fuel sources, delaying the availability of clean energy. To mitigate this, we quantify the impact of increasing intermittent generation from wind and solar on thermal power plants in Chile and introduce a hybrid wind speed forecasting methodology which combines two custom ML models for Chile. The first model is based on TiDE, an MLP-based ML model for short-term forecasts, and the second is based on a graph neural network, GraphCast, for medium-term forecasts up to 10 days. Our hybrid approach outperforms the most accurate operational deterministic systems by 4-21% for short-term forecasts and 5-23% for medium-term forecasts and can directly lower the impact of wind generation on thermal ramping, curtailment, and system-level emissions in Chile.

[LG-134] Informative Subgraphs Aware Masked Auto-Encoder in Dynamic Graphs

链接: https://arxiv.org/abs/2409.09262
作者: Pengfe Jiao,Xinxun Zhang,Mengzhou Gao,Tianpeng Li,Zhidong Zhao
关键词-EN: graph machine learning, Generative self-supervised learning, garnered substantial research, substantial research interest, dynamic graphs
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative self-supervised learning (SSL), especially masked autoencoders (MAE), has greatly succeeded and garnered substantial research interest in graph machine learning. However, the research of MAE in dynamic graphs is still scant. This gap is primarily due to the dynamic graph not only possessing topological structure information but also encapsulating temporal evolution dependency. Applying a random masking strategy which most MAE methods adopt to dynamic graphs will remove the crucial subgraph that guides the evolution of dynamic graphs, resulting in the loss of crucial spatio-temporal information in node representations. To bridge this gap, in this paper, we propose a novel Informative Subgraphs Aware Masked Auto-Encoder in Dynamic Graph, namely DyGIS. Specifically, we introduce a constrained probabilistic generative model to generate informative subgraphs that guide the evolution of dynamic graphs, successfully alleviating the issue of missing dynamic evolution subgraphs. The informative subgraph identified by DyGIS will serve as the input of dynamic graph masked autoencoder (DGMAE), effectively ensuring the integrity of the evolutionary spatio-temporal information within dynamic graphs. Extensive experiments on eleven datasets demonstrate that DyGIS achieves state-of-the-art performance across multiple tasks.

[LG-135] What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data Slicing

链接: https://arxiv.org/abs/2409.09261
作者: Chenyang Yang,Yining Hong,Grace A. Lewis,Tongshuang Wu,Christian Kästner
关键词-EN: Machine learning models, models make mistakes, Machine learning, learning models make, make mistakes
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning models make mistakes, yet sometimes it is difficult to identify the systematic problems behind the mistakes. Practitioners engage in various activities, including error analysis, testing, auditing, and red-teaming, to form hypotheses of what can go (or has gone) wrong with their models. To validate these hypotheses, practitioners employ data slicing to identify relevant examples. However, traditional data slicing is limited by available features and programmatic slicing functions. In this work, we propose SemSlicer, a framework that supports semantic data slicing, which identifies a semantically coherent slice, without the need for existing features. SemSlicer uses Large Language Models to annotate datasets and generate slices from any user-defined slicing criteria. We show that SemSlicer generates accurate slices with low cost, allows flexible trade-offs between different design dimensions, reliably identifies under-performing data slices, and helps practitioners identify useful data slices that reflect systematic problems.

[LG-136] Active Learning to Guide Labeling Efforts for Question Difficulty Estimation ECML-PKDD2024

链接: https://arxiv.org/abs/2409.09258
作者: Arthur Thuy,Ekaterina Loginova,Dries F. Benoit
关键词-EN: Question Difficulty Estimation, Difficulty Estimation, Question Difficulty, language processing techniques, natural language processing
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Machine Learning (stat.ML)
*备注: Published as a workshop paper at ECML-PKDD 2024

点击查看摘要

Abstract:In recent years, there has been a surge in research on Question Difficulty Estimation (QDE) using natural language processing techniques. Transformer-based neural networks achieve state-of-the-art performance, primarily through supervised methods but with an isolated study in unsupervised learning. While supervised methods focus on predictive performance, they require abundant labeled data. On the other hand, unsupervised methods do not require labeled data but rely on a different evaluation metric that is also computationally expensive in practice. This work bridges the research gap by exploring active learning for QDE, a supervised human-in-the-loop approach striving to minimize the labeling efforts while matching the performance of state-of-the-art models. The active learning process iteratively trains on a labeled subset, acquiring labels from human experts only for the most informative unlabeled data points. Furthermore, we propose a novel acquisition function PowerVariance to add the most informative samples to the labeled set, a regression extension to the PowerBALD function popular in classification. We employ DistilBERT for QDE and identify informative samples by applying Monte Carlo dropout to capture epistemic uncertainty in unlabeled samples. The experiments demonstrate that active learning with PowerVariance acquisition achieves a performance close to fully supervised models after labeling only 10% of the training data. The proposed methodology promotes the responsible use of educational resources, makes QDE tools more accessible to course instructors, and is promising for other applications such as personalized support systems and question-answering tools.

[LG-137] Unleash LLMs Potential for Recommendation by Coordinating Twin-Tower Dynamic Semantic Token Generator

链接: https://arxiv.org/abs/2409.09253
作者: Jun Yin,Zhengxin Zeng,Mingzheng Li,Hao Yan,Chaozhuo Li,Weihao Han,Jianjin Zhang,Ruochen Liu,Allen Sun,Denvy Deng,Feng Sun,Qi Zhang,Shirui Pan,Senzhang Wang
关键词-EN: large language models, pre-trained large language, shown fantastic potential, next-generation recommender systems, semantic index
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Owing to the unprecedented capability in semantic understanding and logical reasoning, the pre-trained large language models (LLMs) have shown fantastic potential in developing the next-generation recommender systems (RSs). However, the static index paradigm adopted by current methods greatly restricts the utilization of LLMs capacity for recommendation, leading to not only the insufficient alignment between semantic and collaborative knowledge, but also the neglect of high-order user-item interaction patterns. In this paper, we propose Twin-Tower Dynamic Semantic Recommender (TTDS), the first generative RS which adopts dynamic semantic index paradigm, targeting at resolving the above problems simultaneously. To be more specific, we for the first time contrive a dynamic knowledge fusion framework which integrates a twin-tower semantic token generator into the LLM-based recommender, hierarchically allocating meaningful semantic index for items and users, and accordingly predicting the semantic index of target item. Furthermore, a dual-modality variational auto-encoder is proposed to facilitate multi-grained alignment between semantic and collaborative knowledge. Eventually, a series of novel tuning tasks specially customized for capturing high-order user-item interaction patterns are proposed to take advantages of user historical behavior. Extensive experiments across three public datasets demonstrate the superiority of the proposed methodology in developing LLM-based generative RSs. The proposed TTDS recommender achieves an average improvement of 19.41% in Hit-Rate and 20.84% in NDCG metric, compared with the leading baseline methods.

[LG-138] ETAGE: Enhanced Test Time Adaptation with Integrated Entropy and Gradient Norms for Robust Model Performance

链接: https://arxiv.org/abs/2409.09251
作者: Afshar Shamsi,Rejisa Becirovic,Ahmadreza Argha,Ehsan Abbasnejad,Hamid Alinejad-Rokny,Arash Mohammadi
关键词-EN: equips deep learning, unseen test data, handle unseen test, Label Probability Difference, deep learning models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Test time adaptation (TTA) equips deep learning models to handle unseen test data that deviates from the training distribution, even when source data is inaccessible. While traditional TTA methods often rely on entropy as a confidence metric, its effectiveness can be limited, particularly in biased scenarios. Extending existing approaches like the Pseudo Label Probability Difference (PLPD), we introduce ETAGE, a refined TTA method that integrates entropy minimization with gradient norms and PLPD, to enhance sample selection and adaptation. Our method prioritizes samples that are less likely to cause instability by combining high entropy with high gradient norms out of adaptation, thus avoiding the overfitting to noise often observed in previous methods. Extensive experiments on CIFAR-10-C and CIFAR-100-C datasets demonstrate that our approach outperforms existing TTA techniques, particularly in challenging and biased scenarios, leading to more robust and consistent model performance across diverse test scenarios. The codebase for ETAGE is available on this https URL.

[LG-139] Robust Training of Neural Networks at Arbitrary Precision and Sparsity

链接: https://arxiv.org/abs/2409.09245
作者: Chengxi Ye,Grace Chu,Yanfeng Liu,Yichi Zhang,Lukasz Lew,Andrew Howard
关键词-EN: sparsification introduce obstacles, discontinuous operations inherent, obstacles to backpropagation, discontinuous operations, introduce obstacles
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:The discontinuous operations inherent in quantization and sparsification introduce obstacles to backpropagation. This is particularly challenging when training deep neural networks in ultra-low precision and sparse regimes. We propose a novel, robust, and universal solution: a denoising affine transform that stabilizes training under these challenging conditions. By formulating quantization and sparsification as perturbations during training, we derive a perturbation-resilient approach based on ridge regression. Our solution employs a piecewise constant backbone model to ensure a performance lower bound and features an inherent noise reduction mechanism to mitigate perturbation-induced corruption. This formulation allows existing models to be trained at arbitrarily low precision and sparsity levels with off-the-shelf recipes. Furthermore, our method provides a novel perspective on training temporal binary neural networks, contributing to ongoing efforts to narrow the gap between artificial and biological neural networks.

[LG-140] A Dynamic Weighting Strategy to Mitigate Worker Node Failure in Distributed Deep Learning

链接: https://arxiv.org/abs/2409.09242
作者: Yuesheng Xu,Arielle Carr
关键词-EN: processing vast amounts, efficient training essential, deep learning models, large-scale distributed systems, Elastic Averaging SGD
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:The increasing complexity of deep learning models and the demand for processing vast amounts of data make the utilization of large-scale distributed systems for efficient training essential. These systems, however, face significant challenges such as communication overhead, hardware limitations, and node failure. This paper investigates various optimization techniques in distributed deep learning, including Elastic Averaging SGD (EASGD) and the second-order method AdaHessian. We propose a dynamic weighting strategy to mitigate the problem of straggler nodes due to failure, enhancing the performance and efficiency of the overall training process. We conduct experiments with different numbers of workers and communication periods to demonstrate improved convergence rates and test performance using our strategy.

[LG-141] Cross-Entropy Optimization for Hyperparameter Optimization in Stochastic Gradient-based Approaches to Train Deep Neural Networks

链接: https://arxiv.org/abs/2409.09240
作者: Kevin Li,Fulu Li
关键词-EN: stochastic gradient-based approaches, deep neural networks, train deep neural, neural networks, gradient-based approaches
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 6 pages, 2 figures

点击查看摘要

Abstract:In this paper, we present a cross-entropy optimization method for hyperparameter optimization in stochastic gradient-based approaches to train deep neural networks. The value of a hyperparameter of a learning algorithm often has great impact on the performance of a model such as the convergence speed, the generalization performance metrics, etc. While in some cases the hyperparameters of a learning algorithm can be part of learning parameters, in other scenarios the hyperparameters of a stochastic optimization algorithm such as Adam [5] and its variants are either fixed as a constant or are kept changing in a monotonic way over time. We give an in-depth analysis of the presented method in the framework of expectation maximization (EM). The presented algorithm of cross-entropy optimization for hyperparameter optimization of a learning algorithm (CEHPO) can be equally applicable to other areas of optimization problems in deep learning. We hope that the presented methods can provide different perspectives and offer some insights for optimization problems in different areas of machine learning and beyond.

[LG-142] Rational-WENO: A lightweight physically-consistent three-point weighted essentially non-oscillatory scheme

链接: https://arxiv.org/abs/2409.09217
作者: Shantanu Shahane,Sheide Chammas,Deniz A. Bezgin,Aaron B. Buhendwa,Steffen J. Schmidt,Nikolaus A. Adams,Spencer H. Bryngelson,Yi-Fan Chen,Qing Wang,Fei Sha,Leonardo Zepeda-Núñez
关键词-EN: introducing significant errors, introducing significant, highly dissipative, dissipative at lower, significant errors
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conventional WENO3 methods are known to be highly dissipative at lower resolutions, introducing significant errors in the pre-asymptotic regime. In this paper, we employ a rational neural network to accurately estimate the local smoothness of the solution, dynamically adapting the stencil weights based on local solution features. As rational neural networks can represent fast transitions between smooth and sharp regimes, this approach achieves a granular reconstruction with significantly reduced dissipation, improving the accuracy of the simulation. The network is trained offline on a carefully chosen dataset of analytical functions, bypassing the need for differentiable solvers. We also propose a robust model selection criterion based on estimates of the interpolation’s convergence order on a set of test functions, which correlates better with the model performance in downstream tasks. We demonstrate the effectiveness of our approach on several one-, two-, and three-dimensional fluid flow problems: our scheme generalizes across grid resolutions while handling smooth and discontinuous solutions. In most cases, our rational network-based scheme achieves higher accuracy than conventional WENO3 with the same stencil size, and in a few of them, it achieves accuracy comparable to WENO5, which uses a larger stencil.

[LG-143] Extending predictive process monitoring for collaborative processes

链接: https://arxiv.org/abs/2409.09212
作者: Daniel Calegari,Andrea Delgado
关键词-EN: orchestration-type processes performed, business process execution, mining on business, focused primarily, primarily on orchestration-type
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Process mining on business process execution data has focused primarily on orchestration-type processes performed in a single organization (intra-organizational). Collaborative (inter-organizational) processes, unlike those of orchestration type, expand several organizations (for example, in e-Government), adding complexity and various challenges both for their implementation and for their discovery, prediction, and analysis of their execution. Predictive process monitoring is based on exploiting execution data from past instances to predict the execution of current cases. It is possible to make predictions on the next activity and remaining time, among others, to anticipate possible deviations, violations, and delays in the processes to take preventive measures (e.g., re-allocation of resources). In this work, we propose an extension for collaborative processes of traditional process prediction, considering particularities of this type of process, which add information of interest in this context, for example, the next activity of which participant or the following message to be exchanged between two participants.

[LG-144] FB-HyDON: Parameter-Efficient Physics-Informed Operator Learning of Complex PDEs via Hypernetwork and Finite Basis Domain Decomposition

链接: https://arxiv.org/abs/2409.09207
作者: Milad Ramezankhani,Rishi Yash Parekh,Anirudh Deodhar,Dagnachew Birru
关键词-EN:
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Applied Physics (physics.app-ph)
*备注:

点击查看摘要

[LG-145] Batched Online Contextual Sparse Bandits with Sequential Inclusion of Features RECSYS24

链接: https://arxiv.org/abs/2409.09199
作者: Rowan Swiers,Subash Prabanantham,Andrew Maher
关键词-EN: personalized user experiences, optimize decision making, Multi-armed Bandits, Contextual Bandit problem, user experiences
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 4 pages, 4 figures, Accepted at the CONSEQUENCES 24 workshop, co-located with ACM RecSys 24

点击查看摘要

Abstract:Multi-armed Bandits (MABs) are increasingly employed in online platforms and e-commerce to optimize decision making for personalized user experiences. In this work, we focus on the Contextual Bandit problem with linear rewards, under conditions of sparsity and batched data. We address the challenge of fairness by excluding irrelevant features from decision-making processes using a novel algorithm, Online Batched Sequential Inclusion (OBSI), which sequentially includes features as confidence in their impact on the reward increases. Our experiments on synthetic data show the superior performance of OBSI compared to other algorithms in terms of regret, relevance of features used, and compute.

[LG-146] Are Sparse Neural Networks Better Hard Sample Learners? BMVC2024

链接: https://arxiv.org/abs/2409.09196
作者: Qiao Xiao,Boqian Wu,Lu Yin,Christopher Neil Gadzinski,Tianjin Huang,Mykola Pechenizkiy,Decebal Constantin Mocanu
关键词-EN: demonstrated impressive progress, Sparse Neural Networks, deep neural networks, impressive progress, noisy and intricate
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at British Machine Vision Conference (BMVC 2024)

点击查看摘要

Abstract:While deep learning has demonstrated impressive progress, it remains a daunting challenge to learn from hard samples as these samples are usually noisy and intricate. These hard samples play a crucial role in the optimal performance of deep neural networks. Most research on Sparse Neural Networks (SNNs) has focused on standard training data, leaving gaps in understanding their effectiveness on complex and challenging data. This paper’s extensive investigation across scenarios reveals that most SNNs trained on challenging samples can often match or surpass dense models in accuracy at certain sparsity levels, especially with limited data. We observe that layer-wise density ratios tend to play an important role in SNN performance, particularly for methods that train from scratch without pre-trained initialization. These insights enhance our understanding of SNNs’ behavior and potential for efficient learning approaches in data-centric AI. Our code is publicly available at: \urlthis https URL.

[LG-147] Hierarchical Hypercomplex Network for Multimodal Emotion Recognition

链接: https://arxiv.org/abs/2409.09194
作者: Eleonora Lopez,Aurelio Uncini,Danilo Comminiello
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: The paper has been accepted at MLSP 2024

点击查看摘要

[LG-148] ProcessTBench: An LLM Plan Generation Dataset for Process Mining MICRO

链接: https://arxiv.org/abs/2409.09191
作者: Andrei Cosmin Redis,Mohammadreza Fani Sani,Bahram Zarrin,Andrea Burattin
关键词-EN: Large Language Models, shown significant promise, Large Language, Language Models, plan generation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注: 6 pages, 4 figures, dataset available at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) have shown significant promise in plan generation. Yet, existing datasets often lack the complexity needed for advanced tool use scenarios - such as handling paraphrased query statements, supporting multiple languages, and managing actions that can be done in parallel. These scenarios are crucial for evaluating the evolving capabilities of LLMs in real-world applications. Moreover, current datasets don’t enable the study of LLMs from a process perspective, particularly in scenarios where understanding typical behaviors and challenges in executing the same process under different conditions or formulations is crucial. To address these gaps, we present the ProcessTBench dataset, an extension of the TaskBench dataset specifically designed to evaluate LLMs within a process mining framework.

[LG-149] Quantum-inspired Reinforcement Learning for Synthesizable Drug Design

链接: https://arxiv.org/abs/2409.09183
作者: Dannong Wang,Jintai Chen,Zhiding Liang,Tianfan Fu,Xiao-Yang Liu
关键词-EN: Synthesizable molecular design, drug-relevant oracle functions, ensuring synthetic feasibility, synthesizable molecular optimization, Synthesizable molecular
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Synthesizable molecular design (also known as synthesizable molecular optimization) is a fundamental problem in drug discovery, and involves designing novel molecular structures to improve their properties according to drug-relevant oracle functions (i.e., objective) while ensuring synthetic feasibility. However, existing methods are mostly based on random search. To address this issue, in this paper, we introduce a novel approach using the reinforcement learning method with quantum-inspired simulated annealing policy neural network to navigate the vast discrete space of chemical structures intelligently. Specifically, we employ a deterministic REINFORCE algorithm using policy neural networks to output transitional probability to guide state transitions and local search using genetic algorithm to refine solutions to a local optimum within each iteration. Our methods are evaluated with the Practical Molecular Optimization (PMO) benchmark framework with a 10K query budget. We further showcase the competitive performance of our method by comparing it against the state-of-the-art genetic algorithms-based method.

[LG-150] ransformer with Controlled Attention for Synchronous Motion Captioning

链接: https://arxiv.org/abs/2409.09177
作者: Karim Radouane,Sylvie Ranwez,Julien Lagarde,Andon Tchechmedjiev
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-151] Curricula for Learning Robust Policies over Factored State Representations in Changing Environments

链接: https://arxiv.org/abs/2409.09169
作者: Panayiotis Panayiotou,Özgür Şimşek
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 17th European Workshop on Reinforcement Learning (EWRL 2024)

点击查看摘要

[LG-152] Multimodal Fusion with LLMs for Engagement Prediction in Natural Conversation

链接: https://arxiv.org/abs/2409.09135
作者: Cheng Charles Ma,Kevin Hyekang Joo,Alexandria K. Vail,Sunreeta Bhattacharya,Álvaro Fernández García,Kailana Baker-Matsuoka,Sheryl Mathew,Lori L. Holt,Fernando De la Torre
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 22 pages, first three authors equal contribution

点击查看摘要

[LG-153] FAST: Boosting Uncertainty-based Test Prioritization Methods for Neural Networks via Feature Selection

链接: https://arxiv.org/abs/2409.09130
作者: Jialuo Chen,Jingyi Wang,Xiyue Zhang,Youcheng Sun,Marta Kwiatkowska,Jiming Chen,Peng Cheng
关键词-EN:
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-154] Neural Message Passing Induced by Energy-Constrained Diffusion ICLR2023

链接: https://arxiv.org/abs/2409.09111
作者: Qitian Wu,David Wipf,Junchi Yan
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Extended version from DIFFormer paper in ICLR2023

点击查看摘要

[LG-155] rimming the Risk: Towards Reliable Continuous Training for Deep Learning Inspection Systems

链接: https://arxiv.org/abs/2409.09108
作者: Altaf Allah Abbassi,Houssem Ben Braiek,Foutse Khomh,Thomas Reid
关键词-EN:
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
*备注:

点击查看摘要

[LG-156] Recent Trends in Modelling the Continuous Time Series using Deep Learning: A Survey

链接: https://arxiv.org/abs/2409.09106
作者: Mansura Habiba,Barak A. Pearlmutter,Mehrdad Maleki
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[LG-157] S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training

链接: https://arxiv.org/abs/2409.09099
作者: Yuezhou Hu,Jun Zhu,Jianfei Chen
关键词-EN:
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-158] meds_reader: A fast and efficient EHR processing library

链接: https://arxiv.org/abs/2409.09095
作者: Ethan Steinberg,Michael Wornow,Suhana Bedi,Jason Alan Fries,Matthew B. A. McDermott,Nigam H. Shah
关键词-EN:
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:

点击查看摘要

[LG-159] Y-Drop: A Conductance based Dropout for fully connected layers

链接: https://arxiv.org/abs/2409.09088
作者: Efthymios Georgiou,Georgios Paraskevopoulos,Alexandros Potamianos
关键词-EN:
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Draft paper version

点击查看摘要

[LG-160] Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU

链接: https://arxiv.org/abs/2409.09086
作者: Zhenyu Ning,Jieru Zhao,Qihao Jin,Wenchao Ding,Minyi Guo
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
*备注:

点击查看摘要

[LG-161] HESSO: Towards Automatic Efficient and User Friendly Any Neural Network Training and Pruning

链接: https://arxiv.org/abs/2409.09085
作者: Tianyi Chen,Xiaoyi Qu,David Aponte,Colby Banbury,Jongwoo Ko,Tianyu Ding,Yong Ma,Vladimir Lyapunov,Ilya Zharkov,Luming Liang
关键词-EN:
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: preprint

点击查看摘要

[LG-162] Distributed Convolutional Neural Network Training on Mobile and Edge Clusters

链接: https://arxiv.org/abs/2409.09083
作者: Pranav Rama,Madison Threadgill,Andreas Gerstlauer
关键词-EN:
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-163] D3-GNN: Dynamic Distributed Dataflow for Streaming Graph Neural Networks VLDB’24

链接: https://arxiv.org/abs/2409.09079
作者: Rustam Guliyev,Aparajita Haldar,Hakan Ferhatosmanoglu
关键词-EN:
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 14 pages, 7 figures, published at VLDB’24

点击查看摘要

[LG-164] Fair Reinforcement Learning Algorithm for PV Active Control in LV Distribution Networks

链接: https://arxiv.org/abs/2409.09074
作者: Maurizio Vassallo,Amina Benzerga,Alireza Bahmanyar,Damien Ernst
关键词-EN:
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-165] Joint Model Assignment and Resource Allocation for Cost-Effective Mobile Generative Services

链接: https://arxiv.org/abs/2409.09072
作者: Shuangwei Gao,Peng Yang,Yuxin Kong,Feng Lyu,Ning Zhang
关键词-EN:
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-166] Redefining Data-Centric Design: A New Approach with a Domain Model and Core Data Ontology for Computational Systems

链接: https://arxiv.org/abs/2409.09058
作者: William Johnson,James Davis,Tara Kelly
关键词-EN:
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-167] Identifying Factors to Help Improve Existing Decomposition-Based PMI Estimation Methods

链接: https://arxiv.org/abs/2409.09056
作者: Anna-Maria Nau,Phillip Ditto,Dawnie Wolfe Steadman,Audris Mockus
关键词-EN:
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 17 pages

点击查看摘要

[LG-168] AI Meets the Classroom: When Does ChatGPT Harm Learning?

链接: https://arxiv.org/abs/2409.09047
作者: Matthias Lehmann,Philipp B. Cornelius,Fabian J. Sting
关键词-EN:
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-169] HyPA-RAG: A Hybrid Parameter Adaptive Retrieval-Augmented Generation System for AI Legal and Policy Applications EMNLP2024

链接: https://arxiv.org/abs/2409.09046
作者: Rishi Kalra,Zekun Wu,Ayesha Gulley,Airlie Hilliard,Xin Guan,Adriano Koshiyama,Philip Treleaven
关键词-EN:
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Under review for the EMNLP 2024 Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual

点击查看摘要

[LG-170] ElasticAI: Creating and Deploying Energy-Efficient Deep Learning Accelerator for Pervasive Computing

链接: https://arxiv.org/abs/2409.09044
作者: Chao Qian,Tianheng Ling,Gregor Schiele
关键词-EN:
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: The paper is accepted by 2023 IEEE International Conference on Pervasive Computing and Communications (Best Demo Award)

点击查看摘要

[LG-171] AutoGeo: Automating Geometric Image Dataset Creation for Enhanced Geometry Understanding

链接: https://arxiv.org/abs/2409.09039
作者: Zihan Huang,Tao Wu,Wang Lin,Shengyu Zhang,Jingyuan Chen,Fei Wu
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[LG-172] Evaluating Modern Approaches in 3D Scene Reconstruction: NeRF vs Gaussian-Based Methods

链接: https://arxiv.org/abs/2408.04268
作者: Yiming Zhou,Zixuan Zeng,Andi Chen,Xiaofan Zhou,Haowei Ni,Shiyao Zhang,Panfeng Li,Liangxi Liu,Mengyao Zheng,Xupeng Chen
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by 2024 6th International Conference on Data-driven Optimization of Complex Systems

点击查看摘要

[LG-173] Regional Style and Color Transfer

链接: https://arxiv.org/abs/2404.13880
作者: Zhicheng Ding,Panfeng Li,Qikai Yang,Siyang Li,Qingtian Gong
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by 2024 5th International Conference on Computer Vision, Image and Deep Learning

点击查看摘要

[LG-174] A Comparative Study on Enhancing Prediction in Social Network Advertisement through Data Augmentation

链接: https://arxiv.org/abs/2404.13812
作者: Qikai Yang,Panfeng Li,Xinhe Xu,Zhicheng Ding,Wenjing Zhou,Yi Nian
关键词-EN:
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted by 2024 4th International Conference on Machine Learning and Intelligent Systems Engineering (MLISE)

点击查看摘要

[LG-175] Exploring Diverse Methods in Visual Question Answering

链接: https://arxiv.org/abs/2404.13565
作者: Panfeng Li,Qikai Yang,Xieming Geng,Wenjing Zhou,Zhicheng Ding,Yi Nian
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted by 2024 5th International Conference on Electronic Communication and Artificial Intelligence

点击查看摘要

[LG-176] Confidence Trigger Detection: Accelerating Real-time Tracking-by-detection Systems

链接: https://arxiv.org/abs/1902.00615
作者: Zhicheng Ding,Zhixin Lai,Siyang Li,Panfeng Li,Qikai Yang,Edward Wong
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by 2024 5th International Conference on Electronic Communication and Artificial Intelligence

点击查看摘要

[LG-177] Contextual Hourglass Network for Semantic Segmentation of High Resolution Aerial Imagery

链接: https://arxiv.org/abs/1810.12813
作者: Panfeng Li,Youzuo Lin,Emily Schultz-Fellenz
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by 2024 5th International Conference on Electronic Communication and Artificial Intelligence

点击查看摘要

[LG-178] Online Nonconvex Bilevel Optimization with Bregman Divergences

链接: https://arxiv.org/abs/2409.10470
作者: Jason Bohne,David Rosenberg,Gary Kazantsev,Pawel Polak
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-179] VAE-QWGAN: Improving Quantum GANs for High Resolution Image Generation

链接: https://arxiv.org/abs/2409.10339
作者: Aaron Mark Thomas,Sharu Theresa Jose
关键词-EN:
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 5 pages, 8 figures

点击查看摘要

[LG-180] Research and Design of a Financial Intelligent Risk Control Platform Based on Big Data Analysis and Deep Machine Learning

链接: https://arxiv.org/abs/2409.10331
作者: Shuochen Bi,Yufan Lian,Ziyue Wang
关键词-EN:
类目: Risk Management (q-fin.RM); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures

点击查看摘要

[LG-181] On the Hardness of Meaningful Local Guarantees in Nonsmooth Nonconvex Optimization

链接: https://arxiv.org/abs/2409.10323
作者: Guy Kornowski,Swati Padmanabhan,Ohad Shamir
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 27 pages; comments welcome!

点击查看摘要

[LG-182] Self-Updating Vehicle Monitoring Framework Employing Distributed Acoustic Sensing towards Real-World Settings

链接: https://arxiv.org/abs/2409.10259
作者: Xi Wang,Xin Liu,Songming Zhu,Zhanwen Li,Lina Gao
关键词-EN:
类目: Geophysics (physics.geo-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

[LG-183] Reinforcement learning-based statistical search strategy for an axion model from flavor

链接: https://arxiv.org/abs/2409.10023
作者: Satsuki Nishimura,Coh Miyao,Hajime Otsuka
关键词-EN:
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Theory (hep-th)
*备注: 39 pages, 4 figures

点击查看摘要

[LG-184] Variance-reduced first-order methods for deterministically constrained stochastic nonconvex optimization with strong convergence guarantees

链接: https://arxiv.org/abs/2409.09906
作者: Zhaosong Lu,Sanyou Mei,Yifeng Xiao
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: 29 pages

点击查看摘要

[LG-185] Learning large softmax mixtures with warm start EM

链接: https://arxiv.org/abs/2409.09903
作者: Xin Bing,Florentina Bunea,Jonathan Niles-Weed,Marten Wegkamp
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

[LG-186] RandALO: Out-of-sample risk estimation in no time flat

链接: https://arxiv.org/abs/2409.09781
作者: Parth T. Nobel,Daniel LeJeune,Emmanuel J. Candès
关键词-EN:
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Optimization and Control (math.OC); Computation (stat.CO); Machine Learning (stat.ML)
*备注: 25 pages, 9 figures

点击查看摘要

[LG-187] Extrapolative ML Models for Copolymers

链接: https://arxiv.org/abs/2409.09691
作者: Israrul H. Hashmi,Himanshu,Rahul Karmakar,Tarak K Patra
关键词-EN:
类目: oft Condensed Matter (cond-mat.soft); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-188] Conditional sampling within generative diffusion models

链接: https://arxiv.org/abs/2409.09650
作者: Zheng Zhao,Ziwei Luo,Jens Sjölund,Thomas B. Schön
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-189] Extract and Diffuse: Latent Integration for Improved Diffusion-based Speech and Vocal Enhancement

链接: https://arxiv.org/abs/2409.09642
作者: Yudong Yang,Zhan Liu,Wenyi Yu,Guangzhi Sun,Qiuqiang Kong,Chao Zhang
关键词-EN:
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

[LG-190] Machine learning assisted screening of metal binary alloys for anode materials

链接: https://arxiv.org/abs/2409.09583
作者: Xingyue Shi,Linming Zhou,Yuhui Huang,Yongjun Wu,Zijian Hong
关键词-EN:
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 41 pages include SI, 5 figures in main

点击查看摘要

[LG-191] Astrometric Binary Classification Via Artificial Neural Networks

链接: https://arxiv.org/abs/2409.09563
作者: Joe Smith
关键词-EN:
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: Accepted for publication in Astrophysical Journal (ApJ)

点击查看摘要

[LG-192] MANGO: Disentangled Image Transformation Manifolds with Grouped Operators ICASSP2025

链接: https://arxiv.org/abs/2409.09542
作者: Brighton Ancelin,Yenho Chen,Peimeng Guan,Chiraag Kaushik,Belen Martin-Urcelay,Alex Saad-Falcon,Nakul Singh
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Submitted to IEEE ICASSP 2025. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

[LG-193] Evaluating probabilistic and data-driven inference models for fiber-coupled NV-diamond temperature sensors

链接: https://arxiv.org/abs/2409.09487
作者: Shraddha Rajpal,Zeeshan Ahmed,Tyrus Berry
关键词-EN:
类目: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 15 pages, 8 figures, 3 tables

点击查看摘要

[LG-194] Self-Prompting Polyp Segmentation in Colonoscopy using Hybrid Yolo-SAM 2 Model

链接: https://arxiv.org/abs/2409.09484
作者: Mobina Mansoori,Sajjad Shahabodini,Jamshid Abouei,Konstantinos N. Plataniotis,Arash Mohammadi
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-195] Neumann Series-based Neural Operator for Solving Inverse Medium Problem

链接: https://arxiv.org/abs/2409.09480
作者: Ziyang Liu,Fukai Chen,Junqing Chen,Lingyun Qiu,Zuoqiang Shi
关键词-EN:
类目: Mathematical Physics (math-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-196] Hyperedge Representations with Hypergraph Wavelets: Applications to Spatial Transcriptomics

链接: https://arxiv.org/abs/2409.09469
作者: Xingzhi Sun,Charles Xu,João F. Rocha,Chen Liu,Benjamin Hollander-Bodie,Laney Goldman,Marcello DiStasio,Michael Perlmutter,Smita Krishnaswamy
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

[LG-197] opological Tensor Eigenvalue Theorems in Data Fusion

链接: https://arxiv.org/abs/2409.09392
作者: Ronald Katende
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

[LG-198] WeatherReal: A Benchmark Based on In-Situ Observations for Evaluating Weather Models

链接: https://arxiv.org/abs/2409.09371
作者: Weixin Jin,Jonathan Weyn,Pengcheng Zhao,Siqi Xiang,Jiang Bian,Zuliang Fang,Haiyu Dong,Hongyu Sun,Kit Thambiratnam,Qi Zhang
关键词-EN:
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-199] Persistent pseudopod splitting is an effective chemotaxis strategy in shallow gradients

链接: https://arxiv.org/abs/2409.09342
作者: Albert Alonso,Julius B. Kirkegaard,Robert G. Endres
关键词-EN:
类目: Cell Behavior (q-bio.CB); Machine Learning (cs.LG); Biological Physics (physics.bio-ph)
*备注: 11 pages, 5 figures

点击查看摘要

[LG-200] Automated design of nonreciprocal thermal emitters via Bayesian optimization

链接: https://arxiv.org/abs/2409.09192
作者: Bach Do,Sina Jafari Ghalekohneh,Taiwo Adebiyi,Bo Zhao,Ruda Zhang
关键词-EN:
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注:

点击查看摘要

[LG-201] Fast Structured Orthogonal Dictionary Learning using Householder Reflections ICASSP

链接: https://arxiv.org/abs/2409.09138
作者: Anirudh Dash,Aditya Siripuram
关键词-EN:
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures, Submitted to IEEE ICASSP, 2025

点击查看摘要

[LG-202] Exploring Biological Neuronal Correlations with Quantum Generative Models

链接: https://arxiv.org/abs/2409.09125
作者: Vinicius Hernandes,Eliska Greplova
关键词-EN:
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
*备注: 33 pages, 14 figures, code: this https URL

点击查看摘要

[LG-203] KKT-Informed Neural Network

链接: https://arxiv.org/abs/2409.09087
作者: Carmine Delle Femine
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-204] Bounds on the Generalization Error in Active Learning

链接: https://arxiv.org/abs/2409.09078
作者: Vincent Menden,Yahya Saleh,Armin Iske
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-205] SLiCK: Exploiting Subsequences for Length-Constrained Keyword Spotting

链接: https://arxiv.org/abs/2409.09067
作者: Kumari Nishu,Minsik Cho,Devang Naik
关键词-EN:
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)
*备注:

点击查看摘要

[LG-206] owards safe and tractable Gaussian process-based MPC: Efficient sampling within a sequential quadratic programming framework

链接: https://arxiv.org/abs/2409.08616
作者: Manish Prajapat,Amon Lahr,Johannes Köhler,Andreas Krause,Melanie N. Zeilinger
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: to be published in 63rd IEEE Conference on Decision and Control (CDC 2024)

点击查看摘要

[LG-207] Harnessing Earnings Reports for Stock Predictions: A QLoRA-Enhanced LLM Approach

链接: https://arxiv.org/abs/2408.06634
作者: Haowei Ni,Shuchen Meng,Xupeng Chen,Ziqing Zhao,Andi Chen,Panfeng Li,Shiyao Zhang,Qifu Yin,Yuanqing Wang,Yuxi Chan
关键词-EN:
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Statistical Finance (q-fin.ST)
*备注: Accepted by 2024 6th International Conference on Data-driven Optimization of Complex Systems

点击查看摘要

[LG-208] Zero-Order Optimization for Gaussian Process-based Model Predictive Control

链接: https://arxiv.org/abs/2211.15522
作者: Amon Lahr,Andrea Zanelli,Andrea Carron,Melanie N. Zeilinger
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: accepted for European Journal of Control (EJC), ECC 2023 Special Issue

点击查看摘要

信息检索

[IR-0] Incorporating Classifier-Free Guidance in Diffusion Model-Based Recommendation

链接: https://arxiv.org/abs/2409.10494
作者: Noah Buchanan,Susan Gauch,Quan Mai
关键词-EN: Generative Adversarial Networks, recommender system, diffusion-based recommender system, recommender, classifier-free guidance
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注: 8 pages

点击查看摘要

Abstract:This paper presents a diffusion-based recommender system that incorporates classifier-free guidance. Most current recommender systems provide recommendations using conventional methods such as collaborative or content-based filtering. Diffusion is a new approach to generative AI that improves on previous generative AI approaches such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs). We incorporate diffusion in a recommender system that mirrors the sequence users take when browsing and rating items. Although a few current recommender systems incorporate diffusion, they do not incorporate classifier-free guidance, a new innovation in diffusion models as a whole. In this paper, we present a diffusion recommender system that augments the underlying recommender system model for improved performance and also incorporates classifier-free guidance. Our findings show improvements over state-of-the-art recommender systems for most metrics for several recommendation tasks on a variety of datasets. In particular, our approach demonstrates the potential to provide better recommendations when data is sparse.

[IR-1] Large Language Model Enhanced Hard Sample Identification for Denoising Recommendation

链接: https://arxiv.org/abs/2409.10343
作者: Tianrui Song,Wenshuo Chao,Hao Liu
关键词-EN: unavoidably confronts noise, Implicit feedback, build recommender systems, unavoidably confronts, position bias
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Implicit feedback, often used to build recommender systems, unavoidably confronts noise due to factors such as misclicks and position bias. Previous studies have attempted to alleviate this by identifying noisy samples based on their diverged patterns, such as higher loss values, and mitigating the noise through sample dropping or reweighting. Despite the progress, we observe existing approaches struggle to distinguish hard samples and noise samples, as they often exhibit similar patterns, thereby limiting their effectiveness in denoising recommendations. To address this challenge, we propose a Large Language Model Enhanced Hard Sample Denoising (LLMHD) framework. Specifically, we construct an LLM-based scorer to evaluate the semantic consistency of items with the user preference, which is quantified based on summarized historical user interactions. The resulting scores are used to assess the hardness of samples for the pointwise or pairwise training objectives. To ensure efficiency, we introduce a variance-based sample pruning strategy to filter potential hard samples before scoring. Besides, we propose an iterative preference update module designed to continuously refine summarized user preference, which may be biased due to false-positive user-item interactions. Extensive experiments on three real-world datasets and four backbone recommenders demonstrate the effectiveness of our approach.

[IR-2] beeFormer: Bridging the Gap Between Semantic and Interaction Similarity in Recommender Systems RECSYS2024

链接: https://arxiv.org/abs/2409.10309
作者: Vojtěch Vančura,Pavel Kordík,Milan Straka
关键词-EN: zero-shot recommendation scenarios, Recommender systems, improve their predictions, recommendation scenarios, cold-start or zero-shot
类目: Information Retrieval (cs.IR)
*备注: Accepted to RecSys 2024

点击查看摘要

Abstract:Recommender systems often use text-side information to improve their predictions, especially in cold-start or zero-shot recommendation scenarios, where traditional collaborative filtering approaches cannot be used. Many approaches to text-mining side information for recommender systems have been proposed over recent years, with sentence Transformers being the most prominent one. However, these models are trained to predict semantic similarity without utilizing interaction data with hidden patterns specific to recommender systems. In this paper, we propose beeFormer, a framework for training sentence Transformer models with interaction data. We demonstrate that our models trained with beeFormer can transfer knowledge between datasets while outperforming not only semantic similarity sentence Transformers but also traditional collaborative filtering methods. We also show that training on multiple datasets from different domains accumulates knowledge in a single model, unlocking the possibility of training universal, domain-agnostic sentence Transformer models to mine text representations for recommender systems. We release the source code, trained models, and additional details allowing replication of our experiments at this https URL.

[IR-3] Causal Discovery in Recommender Systems: Example and Discussion RECSYS’24

链接: https://arxiv.org/abs/2409.10271
作者: Emanuele Cavenaghi,Fabio Stella,Markus Zanker
关键词-EN: receiving increasing attention, Causality is receiving, machine learning communities, receiving increasing, increasing attention
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Accepted at the CONSEQUENCES '24 workshop, co-located with ACM RecSys '24

点击查看摘要

Abstract:Causality is receiving increasing attention by the artificial intelligence and machine learning communities. This paper gives an example of modelling a recommender system problem using causal graphs. Specifically, we approached the causal discovery task to learn a causal graph by combining observational data from an open-source dataset with prior knowledge. The resulting causal graph shows that only a few variables effectively influence the analysed feedback signals. This contrasts with the recent trend in the machine learning community to include more and more variables in massive models, such as neural networks.

[IR-4] Enhancing Personalized Recipe Recommendation Through Multi-Class Classification

链接: https://arxiv.org/abs/2409.10267
作者: Harish Neelam,Koushik Sai Veerella
关键词-EN: diverse culinary preferences, intends to address, address the challenge, realm of diverse, association analysis
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper intends to address the challenge of personalized recipe recommendation in the realm of diverse culinary preferences. The problem domain involves recipe recommendations, utilizing techniques such as association analysis and classification. Association analysis explores the relationships and connections between different ingredients to enhance the user experience. Meanwhile, the classification aspect involves categorizing recipes based on user-defined ingredients and preferences. A unique aspect of the paper is the consideration of recipes and ingredients belonging to multiple classes, recognizing the complexity of culinary combinations. This necessitates a sophisticated approach to classification and recommendation, ensuring the system accommodates the nature of recipe categorization. The paper seeks not only to recommend recipes but also to explore the process involved in achieving accurate and personalized recommendations.

[IR-5] jina-embeddings-v3: Multilingual Embeddings With Task LoRA

链接: https://arxiv.org/abs/2409.10173
作者: Saba Sturua,Isabelle Mohr,Mohammad Kalim Akram,Michael Günther,Bo Wang,Markus Krimmel,Feng Wang,Georgios Mastrapas,Andreas Koukounas,Andreas Koukounas,Nan Wang,Han Xiao
关键词-EN: supporting context lengths, million parameters, supporting context, Matryoshka Representation Learning, long-context retrieval tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 20 pages, pp11-13 references, pp14-20 appendix and experiment tables

点击查看摘要

Abstract:We introduce jina-embeddings-v3, a novel text embedding model with 570 million parameters, achieves state-of-the-art performance on multilingual data and long-context retrieval tasks, supporting context lengths of up to 8192 tokens. The model includes a set of task-specific Low-Rank Adaptation (LoRA) adapters to generate high-quality embeddings for query-document retrieval, clustering, classification, and text matching. Additionally, Matryoshka Representation Learning is integrated into the training process, allowing flexible truncation of embedding dimensions without compromising performance. Evaluation on the MTEB benchmark shows that jina-embeddings-v3 outperforms the latest proprietary embeddings from OpenAI and Cohere on English tasks, while achieving superior performance compared to multilingual-e5-large-instruct across all multilingual tasks.

[IR-6] rustworthiness in Retrieval-Augmented Generation Systems: A Survey

链接: https://arxiv.org/abs/2409.10102
作者: Yujia Zhou,Yan Liu,Xiaoxi Li,Jiajie Jin,Hongjin Qian,Zheng Liu,Chaozhuo Li,Zhicheng Dou,Tsung-Yi Ho,Philip S. Yu
关键词-EN: Large Language Models, Large Language, RAG systems, Retrieval-Augmented Generation, development of Large
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has quickly grown into a pivotal paradigm in the development of Large Language Models (LLMs). While much of the current research in this field focuses on performance optimization, particularly in terms of accuracy and efficiency, the trustworthiness of RAG systems remains an area still under exploration. From a positive perspective, RAG systems are promising to enhance LLMs by providing them with useful and up-to-date knowledge from vast external databases, thereby mitigating the long-standing problem of hallucination. While from a negative perspective, RAG systems are at the risk of generating undesirable contents if the retrieved information is either inappropriate or poorly utilized. To address these concerns, we propose a unified framework that assesses the trustworthiness of RAG systems across six key dimensions: factuality, robustness, fairness, transparency, accountability, and privacy. Within this framework, we thoroughly review the existing literature on each dimension. Additionally, we create the evaluation benchmark regarding the six dimensions and conduct comprehensive evaluations for a variety of proprietary and open-source models. Finally, we identify the potential challenges for future research based on our investigation results. Through this work, we aim to lay a structured foundation for future investigations and provide practical insights for enhancing the trustworthiness of RAG systems in real-world applications.

[IR-7] Global Lightning-Ignited Wildfires Prediction and Climate Change Projections based on Explainable Machine Learning Models

链接: https://arxiv.org/abs/2409.10046
作者: Assaf Shmuel,Teddy Lazebnik,Oren Glickman,Eyal Heifetz,Colin Price
关键词-EN: lightning-ignited wildfires, significant natural disaster, natural disaster risk, Wildfires, natural disaster
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Wildfires pose a significant natural disaster risk to populations and contribute to accelerated climate change. As wildfires are also affected by climate change, extreme wildfires are becoming increasingly frequent. Although they occur less frequently globally than those sparked by human activities, lightning-ignited wildfires play a substantial role in carbon emissions and account for the majority of burned areas in certain regions. While existing computational models, especially those based on machine learning, aim to predict lightning-ignited wildfires, they are typically tailored to specific regions with unique characteristics, limiting their global applicability. In this study, we present machine learning models designed to characterize and predict lightning-ignited wildfires on a global scale. Our approach involves classifying lightning-ignited versus anthropogenic wildfires, and estimating with high accuracy the probability of lightning to ignite a fire based on a wide spectrum of factors such as meteorological conditions and vegetation. Utilizing these models, we analyze seasonal and spatial trends in lightning-ignited wildfires shedding light on the impact of climate change on this phenomenon. We analyze the influence of various features on the models using eXplainable Artificial Intelligence (XAI) frameworks. Our findings highlight significant global differences between anthropogenic and lightning-ignited wildfires. Moreover, we demonstrate that, even over a short time span of less than a decade, climate changes have steadily increased the global risk of lightning-ignited wildfires. This distinction underscores the imperative need for dedicated predictive models and fire weather indices tailored specifically to each type of wildfire.

[IR-8] DiffATR: Diffusion-based Generative Modeling for Audio-Text Retrieval INTERSPEECH2024

链接: https://arxiv.org/abs/2409.10025
作者: Yifei Xin,Xuxin Cheng,Zhihong Zhu,Xusheng Yang,Yuexian Zou
关键词-EN: Existing audio-text retrieval, Existing audio-text, methods are essentially, conditional likelihood, aim to maximize
类目: ound (cs.SD); Information Retrieval (cs.IR); Audio and Speech Processing (eess.AS)
*备注: Accepted by Interspeech2024

点击查看摘要

Abstract:Existing audio-text retrieval (ATR) methods are essentially discriminative models that aim to maximize the conditional likelihood, represented as p(candidates|query). Nevertheless, this methodology fails to consider the intrinsic data distribution p(query), leading to difficulties in discerning out-of-distribution data. In this work, we attempt to tackle this constraint through a generative perspective and model the relationship between audio and text as their joint probability p(candidates,query). To this end, we present a diffusion-based ATR framework (DiffATR), which models ATR as an iterative procedure that progressively generates joint distribution from noise. Throughout its training phase, DiffATR is optimized from both generative and discriminative viewpoints: the generator is refined through a generation loss, while the feature extractor benefits from a contrastive loss, thus combining the merits of both methodologies. Experiments on the AudioCaps and Clotho datasets with superior performances, verify the effectiveness of our approach. Notably, without any alterations, our DiffATR consistently exhibits strong performance in out-of-domain retrieval settings.

[IR-9] Practical and Asymptotically Optimal Quantization of High-Dimensional Vectors in Euclidean Space for Approximate Nearest Neighbor Search

链接: https://arxiv.org/abs/2409.09913
作者: Jianyang Gao,Yutong Gou,Yuexuan Xu,Yongyi Yang,Cheng Long,Raymond Chi-Wing Wong
关键词-EN: Approximate nearest neighbor, high-dimensional Euclidean space, Approximate nearest, high-dimensional Euclidean, nearest neighbor
类目: Databases (cs.DB); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
*备注: Preprint

点击查看摘要

Abstract:Approximate nearest neighbor (ANN) query in high-dimensional Euclidean space is a key operator in database systems. For this query, quantization is a popular family of methods developed for compressing vectors and reducing memory consumption. Recently, a method called RaBitQ achieves the state-of-the-art performance among these methods. It produces better empirical performance in both accuracy and efficiency when using the same compression rate and provides rigorous theoretical guarantees. However, the method is only designed for compressing vectors at high compression rates (32x) and lacks support for achieving higher accuracy by using more space. In this paper, we introduce a new quantization method to address this limitation by extending RaBitQ. The new method inherits the theoretical guarantees of RaBitQ and achieves the asymptotic optimality in terms of the trade-off between space and error bounds as to be proven in this study. Additionally, we present efficient implementations of the method, enabling its application to ANN queries to reduce both space and time consumption. Extensive experiments on real-world datasets confirm that our method consistently outperforms the state-of-the-art baselines in both accuracy and efficiency when using the same amount of memory.

[IR-10] Proximal Ranking Policy Optimization for Practical Safety in Counterfactual Learning to Rank RECSYS2024

链接: https://arxiv.org/abs/2409.09881
作者: Shashank Gupta,Harrie Oosterhuis,Maarten de Rijke
关键词-EN: Counterfactual learning, CLTR, produce sub-optimal models, produce sub-optimal, PRPO
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: Accepted at the CONSEQUENCES 2024 workshop, co-located with ACM RecSys 2024

点击查看摘要

Abstract:Counterfactual learning to rank (CLTR) can be risky and, in various circumstances, can produce sub-optimal models that hurt performance when deployed. Safe CLTR was introduced to mitigate these risks when using inverse propensity scoring to correct for position bias. However, the existing safety measure for CLTR is not applicable to state-of-the-art CLTR methods, cannot handle trust bias, and relies on specific assumptions about user behavior. We propose a novel approach, proximal ranking policy optimization (PRPO), that provides safety in deployment without assumptions about user behavior. PRPO removes incentives for learning ranking behavior that is too dissimilar to a safe ranking model. Thereby, PRPO imposes a limit on how much learned models can degrade performance metrics, without relying on any specific user assumptions. Our experiments show that PRPO provides higher performance than the existing safe inverse propensity scoring approach. PRPO always maintains safety, even in maximally adversarial situations. By avoiding assumptions, PRPO is the first method with unconditional safety in deployment that translates to robust safety for real-world applications.

[IR-11] CROSS-JEM: Accurate and Efficient Cross-encoders for Short-text Ranking Tasks

链接: https://arxiv.org/abs/2409.09795
作者: Bhawna Paliwal,Deepak Saini,Mudit Dhawan,Siddarth Asokan,Nagarajan Natarajan,Surbhi Aggarwal,Pankaj Malhotra,Jian Jiao,Manik Varma
关键词-EN: core problem, Ranking, Joint Efficient Modeling, score multiple items, items
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Ranking a set of items based on their relevance to a given query is a core problem in search and recommendation. Transformer-based ranking models are the state-of-the-art approaches for such tasks, but they score each query-item independently, ignoring the joint context of other relevant items. This leads to sub-optimal ranking accuracy and high computational costs. In response, we propose Cross-encoders with Joint Efficient Modeling (CROSS-JEM), a novel ranking approach that enables transformer-based models to jointly score multiple items for a query, maximizing parameter utilization. CROSS-JEM leverages (a) redundancies and token overlaps to jointly score multiple items, that are typically short-text phrases arising in search and recommendations, and (b) a novel training objective that models ranking probabilities. CROSS-JEM achieves state-of-the-art accuracy and over 4x lower ranking latency over standard cross-encoders. Our contributions are threefold: (i) we highlight the gap between the ranking application’s need for scoring thousands of items per query and the limited capabilities of current cross-encoders; (ii) we introduce CROSS-JEM for joint efficient scoring of multiple items per query; and (iii) we demonstrate state-of-the-art accuracy on standard public datasets and a proprietary dataset. CROSS-JEM opens up new directions for designing tailored early-attention-based ranking models that incorporate strict production constraints such as item multiplicity and latency.

[IR-12] Measuring Recency Bias In Sequential Recommendation Systems RECSYS’24

链接: https://arxiv.org/abs/2409.09722
作者: Jeonglyul Oh,Sungzoon Cho
关键词-EN: overly high emphasis, sequential recommendation system, recommendation system refers, Recency bias, recent items
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted at the CONSEQUENCES '24 workshop, co-located with ACM RecSys '24

点击查看摘要

Abstract:Recency bias in a sequential recommendation system refers to the overly high emphasis placed on recent items within a user session. This bias can diminish the serendipity of recommendations and hinder the system’s ability to capture users’ long-term interests, leading to user disengagement. We propose a simple yet effective novel metric specifically designed to quantify recency bias. Our findings also demonstrate that high recency bias measured in our proposed metric adversely impacts recommendation performance too, and mitigating it results in improved recommendation performances across all models evaluated in our experiments, thus highlighting the importance of measuring recency bias.

[IR-13] AlpaPICO: Extraction of PICO Frames from Clinical Trial Documents Using LLMs

链接: https://arxiv.org/abs/2409.09704
作者: Madhusudan Ghosh,Shrimon Mukherjee,Asmit Ganguly,Partha Basuchowdhuri,Sudip Kumar Naskar,Debasis Ganguly
关键词-EN: clinical trial reports, clinical trial, conduct systematic reviews, scrutinizing systematic reviews, systematic reviews
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted at Methods

点击查看摘要

Abstract:In recent years, there has been a surge in the publication of clinical trial reports, making it challenging to conduct systematic reviews. Automatically extracting Population, Intervention, Comparator, and Outcome (PICO) from clinical trial studies can alleviate the traditionally time-consuming process of manually scrutinizing systematic reviews. Existing approaches of PICO frame extraction involves supervised approach that relies on the existence of manually annotated data points in the form of BIO label tagging. Recent approaches, such as In-Context Learning (ICL), which has been shown to be effective for a number of downstream NLP tasks, require the use of labeled examples. In this work, we adopt ICL strategy by employing the pretrained knowledge of Large Language Models (LLMs), gathered during the pretraining phase of an LLM, to automatically extract the PICO-related terminologies from clinical trial documents in unsupervised set up to bypass the availability of large number of annotated data instances. Additionally, to showcase the highest effectiveness of LLM in oracle scenario where large number of annotated samples are available, we adopt the instruction tuning strategy by employing Low Rank Adaptation (LORA) to conduct the training of gigantic model in low resource environment for the PICO frame extraction task. Our empirical results show that our proposed ICL-based framework produces comparable results on all the version of EBM-NLP datasets and the proposed instruction tuned version of our framework produces state-of-the-art results on all the different EBM-NLP datasets. Our project is available at \urlthis https URL.

[IR-14] Unleash LLMs Potential for Recommendation by Coordinating Twin-Tower Dynamic Semantic Token Generator

链接: https://arxiv.org/abs/2409.09253
作者: Jun Yin,Zhengxin Zeng,Mingzheng Li,Hao Yan,Chaozhuo Li,Weihao Han,Jianjin Zhang,Ruochen Liu,Allen Sun,Denvy Deng,Feng Sun,Qi Zhang,Shirui Pan,Senzhang Wang
关键词-EN: large language models, pre-trained large language, shown fantastic potential, next-generation recommender systems, semantic index
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Owing to the unprecedented capability in semantic understanding and logical reasoning, the pre-trained large language models (LLMs) have shown fantastic potential in developing the next-generation recommender systems (RSs). However, the static index paradigm adopted by current methods greatly restricts the utilization of LLMs capacity for recommendation, leading to not only the insufficient alignment between semantic and collaborative knowledge, but also the neglect of high-order user-item interaction patterns. In this paper, we propose Twin-Tower Dynamic Semantic Recommender (TTDS), the first generative RS which adopts dynamic semantic index paradigm, targeting at resolving the above problems simultaneously. To be more specific, we for the first time contrive a dynamic knowledge fusion framework which integrates a twin-tower semantic token generator into the LLM-based recommender, hierarchically allocating meaningful semantic index for items and users, and accordingly predicting the semantic index of target item. Furthermore, a dual-modality variational auto-encoder is proposed to facilitate multi-grained alignment between semantic and collaborative knowledge. Eventually, a series of novel tuning tasks specially customized for capturing high-order user-item interaction patterns are proposed to take advantages of user historical behavior. Extensive experiments across three public datasets demonstrate the superiority of the proposed methodology in developing LLM-based generative RSs. The proposed TTDS recommender achieves an average improvement of 19.41% in Hit-Rate and 20.84% in NDCG metric, compared with the leading baseline methods.

[IR-15] HyPA-RAG: A Hybrid Parameter Adaptive Retrieval-Augmented Generation System for AI Legal and Policy Applications EMNLP2024

链接: https://arxiv.org/abs/2409.09046
作者: Rishi Kalra,Zekun Wu,Ayesha Gulley,Airlie Hilliard,Xin Guan,Adriano Koshiyama,Philip Treleaven
关键词-EN:
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Under review for the EMNLP 2024 Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual

点击查看摘要

[IR-16] A Comparative Study on Enhancing Prediction in Social Network Advertisement through Data Augmentation

链接: https://arxiv.org/abs/2404.13812
作者: Qikai Yang,Panfeng Li,Xinhe Xu,Zhicheng Ding,Wenjing Zhou,Yi Nian
关键词-EN:
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted by 2024 4th International Conference on Machine Learning and Intelligent Systems Engineering (MLISE)

点击查看摘要

附件下载

点击下载今日全部论文列表