Arxiv今日论文 | 2024-09-11

本篇博文主要展示 2024-09-11 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱，同样每天10:30左右邮件定时自动发送。

链接: https://arxiv.org/abs/2409.06691
作者: Hiroki Furuta,Kuang-Huei Lee,Shixiang Shane Gu,Yutaka Matsuo,Aleksandra Faust,Heiga Zen,Izzeddin Gur
关键词-EN: human preferences assume, human preferences, Direct Preference Optimization, soft preference labels, algorithms for aligning
关键词-ZH: 人类偏好假设、人类偏好、直接偏好优化、软偏好标签、对齐算法
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Many algorithms for aligning LLMs with human preferences assume that human preferences are binary and deterministic. However, it is reasonable to think that they can vary with different individuals, and thus should be distributional to reflect the fine-grained relationship between the responses. In this work, we introduce the distributional soft preference labels and improve Direct Preference Optimization (DPO) with a weighted geometric average of the LLM output likelihood in the loss function. In doing so, the scale of learning loss is adjusted based on the soft labels, and the loss with equally preferred responses would be close to zero. This simple modification can be easily applied to any DPO family and helps the models escape from the over-optimization and objective mismatch prior works suffer from. In our experiments, we simulate the soft preference labels with AI feedback from LLMs and demonstrate that geometric averaging consistently improves performance on standard benchmarks for alignment research. In particular, we observe more preferable responses than binary labels and significant improvements with data where modestly-confident labels are in the majority.
摘要：许多将LLM与人的偏好相匹配的算法都假定人的偏好是二元的和确定性的。然而，有理由认为它们可以随不同的个体而不同，因此应该是分布式的，以反映响应之间的细粒度关系。在这项工作中，我们引入了分布软偏好标签，并利用损失函数中LLM输出似然的加权几何平均来改进直接偏好优化(DPO)。在这样做的时候，学习损失的规模基于软标签进行调整，并且具有相同偏好响应的损失将接近于零。这种简单的修改可以很容易地应用于任何DPO家族，并帮助模型摆脱以往工作中的过度优化和目标失配。在我们的实验中，我们用来自LLMS的人工智能反馈来模拟软偏好标签，并证明了几何平均在标准基准上一致地改善了比对研究的性能。特别是，我们观察到比二元标签更可取的反应，以及在适度自信的标签占多数的情况下数据的显著改善。

[NLP-1] E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning
[NLP-1] E2 LLM：用于长上下文理解和推理的编码器扩展大型语言模型

链接: https://arxiv.org/abs/2409.06679
作者: Zihan Liao,Jun Wang,Hang Yu,Lingxiao Wei,Jianguo Li,Jun Wang,Wei Zhang
关键词-EN: Large Language Models, Elongated Large Language, Large Language, Language Models, Encoder Elongated Large
关键词-ZH: 大型语言模型，拉长大型语言，大型语言，语言模型，编码器拉长大型
类目: Computation and Language (cs.CL)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:In the realm of Large Language Models (LLMs), the ability to process long contexts is increasingly crucial for tasks such as multi-round dialogues, code generation, and document summarization. This paper addresses the challenges of enhancing the long-context performance, reducing computational complexity, and leveraging pretrained models collectively termed the “impossible triangle.” We introduce E2LLM (Encoder Elongated Large Language Models), a novel approach that effectively navigates this paradox. The method involves splitting long contexts into chunks, compressing each into embedding vectors via a pretrained text encoder, and utilizing an adapter to align these representations with a decoder-only LLM. Two training objectives, focusing on reconstruction of the encoder output and long-context instruction fine-tuning, are employed to facilitate the understanding of soft prompts by the LLM. Experimental results demonstrate that E2LLM achieves superior performance in long-context scenarios while balancing efficiency, performance, and compatibility with pretrained models. Our framework thus represents a significant advancement in the field, contributing to effective long-text modeling.
摘要：在大型语言模型(LLM)领域，处理长上下文的能力对于多轮对话、代码生成和文档摘要等任务越来越重要。本文解决了增强长上下文性能、降低计算复杂性和利用预先训练的模型(统称为“不可能三角”)的挑战。我们介绍了一种有效解决这一悖论的新方法–编码拉长大语言模型(E2LLM)。该方法包括将长上下文分割成块，通过预先训练的文本编码器将每个块压缩成嵌入向量，并利用适配器将这些表示与仅解码器的LLM对准。采用了两个培训目标，重点是编码器输出的重建和长上下文指令的微调，以促进LLM对软提示的理解。实验结果表明，E2LLM在长上下文场景中取得了优异的性能，同时在效率、性能和与预先训练的模型的兼容性之间取得了平衡。因此，我们的框架代表着该领域的重大进步，有助于有效的长文本建模。

[NLP-2] LLaMA-Omni: Seamless Speech Interaction with Large Language Models
[NLP-2] LLaMA-Omni：与大型语言模型的无缝语音交互

链接: https://arxiv.org/abs/2409.06666
作者: Qingkai Fang,Shoutao Guo,Yan Zhou,Zhengrui Ma,Shaolei Zhang,Yang Feng
关键词-EN: significantly enhancing user, enhancing user experience, enable real-time interaction, traditional text-based interaction, large language models
关键词-ZH: 显着增强用户能力，增强用户体验，实现实时交互、传统基于文本的交互、大型语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Preprint. Project: this https URL

点击查看摘要

Abstract:Models like GPT-4o enable real-time interaction with large language models (LLMs) through speech, significantly enhancing user experience compared to traditional text-based interaction. However, there is still a lack of exploration on how to build speech interaction models based on open-source LLMs. To address this, we propose LLaMA-Omni, a novel model architecture designed for low-latency and high-quality speech interaction with LLMs. LLaMA-Omni integrates a pretrained speech encoder, a speech adaptor, an LLM, and a streaming speech decoder. It eliminates the need for speech transcription, and can simultaneously generate text and speech responses directly from speech instructions with extremely low latency. We build our model based on the latest Llama-3.1-8B-Instruct model. To align the model with speech interaction scenarios, we construct a dataset named InstructS2S-200K, which includes 200K speech instructions and corresponding speech responses. Experimental results show that compared to previous speech-language models, LLaMA-Omni provides better responses in both content and style, with a response latency as low as 226ms. Additionally, training LLaMA-Omni takes less than 3 days on just 4 GPUs, paving the way for the efficient development of speech-language models in the future.
摘要：像GPT-40这样的模型可以通过语音与大型语言模型(LLM)进行实时交互，与传统的基于文本的交互相比，显著增强了用户体验。然而，如何构建基于开源LLMS的语音交互模型还缺乏探索。为了解决这一问题，我们提出了Llama-Omni，一种新的模型体系结构，旨在与LLMS进行低延迟和高质量的语音交互。Llama-Omni集成了一个预先训练的语音编码器、一个语音适配器、一个LLM和一个流语音解码器。它消除了对语音转录的需要，可以同时以极低的延迟直接从语音指令生成文本和语音响应。我们基于最新的LLAMA-3.1-8B-指令模型构建我们的模型。为了使模型与语音交互场景相匹配，我们构建了一个名为InstructS2S-200K的数据集，该数据集包括20万条语音指令和对应的语音响应。实验结果表明，与以往的语音-语言模型相比，LAMA-OMNI在内容和风格上都提供了更好的响应，响应延迟低至226ms。此外，培训骆驼-OMNI只需不到3天的时间只需4个GPU，为未来高效开发语音语言模型铺平了道路。

[NLP-3] XBLEU: Automatic Metric for Evaluate LaTeX Format
[NLP-3] XBLEU：评估LaTeX格式的自动指标

链接: https://arxiv.org/abs/2409.06639
作者: Kyudan Jung,Nam-Joon Kim,Hyongon Ryu,Sieun Hyeon,Seung-jun Lee,Hyeok-jae Lee
关键词-EN: computer science, special formatting, highly suited, suited to creating, creating documents
关键词-ZH: 计算机科学，特殊格式，非常适合，适合创建，创建文档
类目: Computation and Language (cs.CL)
备注: 5 pages, 3 figures

点击查看摘要

Abstract:LaTeX is highly suited to creating documents with special formatting, particularly in the fields of science, technology, mathematics, and computer science. Despite the increasing use of mathematical expressions in LaTeX format with language models, there are no evaluation metrics for evaluating them. In this study, we propose TeXBLEU, an evaluation metric tailored for mathematical expressions in LaTeX format, based on the n-gram-based BLEU metric that is widely used for translation tasks. The proposed TeXBLEU includes a predefined tokenizer trained on the arXiv paper dataset and a finetuned embedding model. It also considers the positional embedding of tokens. Simultaneously, TeXBLEU compares tokens based on n-grams and computes the score using exponentiation of a logarithmic sum, similar to the original BLEU. Experimental results show that TeXBLEU outperformed traditional evaluation metrics such as BLEU, Rouge, CER, and WER when compared to human evaluation data on the test dataset of the MathBridge dataset, which contains 1,000 data points. The average correlation coefficient with human evaluation was 0.71, which is an improvement of 87% compared with BLEU, which had the highest correlation with human evaluation data among the existing metrics. The code is available at this https URL.
摘要：LaTeX非常适合创建特殊格式的文档，特别是在科学、技术、数学和计算机科学领域。尽管在语言模型中越来越多地使用LaTeX格式的数学表达式，但没有评估它们的评估标准。在本研究中，我们基于广泛用于翻译任务的基于n元语法的BLEU度量，提出了一种为LaTeX格式的数学表达式量身定做的评估度量TeXBLEU。提出的TeXBLEU包括一个在arxiv纸质数据集上训练的预定义标记器和一个精调的嵌入模型。它还考虑了令牌的位置嵌入。同时，TeXBLEU基于n元语法比较令牌，并使用对数和的指数运算来计算分数，类似于最初的BLEU。实验结果表明，在包含1,000个数据点的MathBridge数据集的测试数据集上，TeXBLEU的性能优于BLEU、Rouge、CER和WER等传统评估指标。与人的评价的平均相关系数为0.71，与BLEU相比提高了87%，后者是现有指标中与人的评价数据相关性最高的。代码可在此HTTPS URL上找到。

[NLP-4] MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders
[NLP-4] MoWE-音频：具有混合弱编码器的多任务音频LLM

链接: https://arxiv.org/abs/2409.06635
作者: Wenyu Zhang,Shuo Sun,Bin Wang,Xunlong Zou,Zhuohan Liu,Yingxu He,Geyu Lin,Nancy F. Chen,Ai Ti Aw
关键词-EN: language processing capabilities, natural language processing, enhanced natural language, inputs alongside text, large language models
关键词-ZH: 语言处理能力、自然语言处理、增强的自然语言、文本输入、大型语言模型
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:The rapid advancements in large language models (LLMs) have significantly enhanced natural language processing capabilities, facilitating the development of AudioLLMs that process and understand speech and audio inputs alongside text. Existing AudioLLMs typically combine a pre-trained audio encoder with a pre-trained LLM, which are subsequently finetuned on specific audio tasks. However, the pre-trained audio encoder has constrained capacity to capture features for new tasks and datasets. To address this, we propose to incorporate mixtures of `weak’ encoders (MoWE) into the AudioLLM framework. MoWE supplements a base encoder with a pool of relatively light weight encoders, selectively activated based on the audio input to enhance feature extraction without significantly increasing model size. Our empirical results demonstrate that MoWE effectively improves multi-task performance, broadening the applicability of AudioLLMs to more diverse audio tasks.
摘要：大型语言模型（LLM）的快速发展显着增强了自然语言处理能力，促进了AudioLLM的开发，可以处理和理解语音和音频输入以及文本。现有的AudioLLM通常将预训练的音频编码器与预训练的LLM相结合，随后对特定的音频任务进行微调。然而，预训练的音频编码器捕获新任务和数据集特征的能力受到限制。为了解决这个问题，我们建议将“弱”编码器（MoWE）的混合物纳入AudioLLM框架中。MoWE用一组相对较轻的权重编码器来补充基本编码器，这些编码器根据音频输入选择性地激活，以增强特征提取，而不会显着增加模型大小。我们的实证结果表明，MoWE有效地提高了多任务性能，将AudioLLM的适用性扩展到更多样化的音频任务。

[NLP-5] A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio
[NLP-5] 优化选择额外语言混合比例的Llama-3 70 B后训练实践

链接: https://arxiv.org/abs/2409.06624
作者: Ningyuan Xi,Yetao Wu,Kun Fan,Teng Chen,Qingqing Gu,Peng Yu,Jinxian Qu,Chenxi Liu,Zhonglin Jiang,Yong Chen,Luo Ji
关键词-EN: Large Language Models, unfamiliar language skill, Continual Pre-Trained, Language Mixture Ratio, Large Language
关键词-ZH: 大型语言模型、不熟悉的语言技能、持续预培训、语言混合比、大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:Large Language Models (LLM) often needs to be Continual Pre-Trained (CPT) to obtain the unfamiliar language skill or adapt into new domains. The huge training cost of CPT often asks for cautious choice of key hyper-parameters such as the mixture ratio of extra language or domain corpus. However, there is no systematic study which bridge the gap between the optimal mixture ratio and the actual model performance, and the gap between experimental scaling law and the actual deployment in the full model size. In this paper, we perform CPT on Llama-3 8B and 70B to enhance its Chinese ability. We study the optimal correlation between the Additional Language Mixture Ratio (ALMR) and the Learning Rate (LR) on the 8B size which directly indicate the optimal experimental set up. By thorough choice of hyper-parameter, and subsequent fine-tuning, the model capability is improved not only on the Chinese-related benchmark, but also some specific domains including math, coding and emotional intelligence. We deploy the final 70B version of LLM on an real-life chat system which obtain satisfying performance.
摘要：大型语言模型(LLM)往往需要经过持续的预训练(CPT)才能获得不熟悉的语言技能或适应新的领域。CPT庞大的训练成本往往要求谨慎选择关键的超参数，如额外语言或领域语料库的混合比例。然而，还没有系统的研究来弥合最优混合比与实际模型性能之间的差距，以及在全模型尺寸下实验缩尺规律与实际部署之间的差距。本文对LLAMA-38B和70B进行了CPT，以增强其中文能力。我们研究了在8B大小上附加语言混合比(ALMR)和学习率(LR)之间的最佳相关性，它直接指示了最优的实验设置。通过对超参数的彻底选择和后续的微调，模型的性能不仅在与中文相关的基准上得到了提高，而且在数学、编码和情绪智力等特定领域也得到了提高。我们将最终的70B版本的LLM部署在一个真实的聊天系统上，取得了令人满意的性能。

[NLP-6] Exploring Italian sentence embeddings properties through multi-tasking
[NLP-6] 通过多任务处理探索意大利句子嵌入属性

链接: https://arxiv.org/abs/2409.06622
作者: Vivi Nastase,Giuseppe Samo,Chunyang Jiang,Paola Merlo
关键词-EN: degree existing LLMs, Blackbird Language Matrices, existing LLMs encode, multi-task setting, degree existing
关键词-ZH: 现有学位LLM、Blackbird语言矩阵、现有LLM编码、多任务设置、现有学位
类目: Computation and Language (cs.CL)
备注: 9 pages, 9 figures, 3 tables

点击查看摘要

Abstract:We investigate to what degree existing LLMs encode abstract linguistic information in Italian in a multi-task setting. We exploit curated synthetic data on a large scale – several Blackbird Language Matrices (BLMs) problems in Italian – and use them to study how sentence representations built using pre-trained language models encode specific syntactic and semantic information. We use a two-level architecture to model separately a compression of the sentence embeddings into a representation that contains relevant information for a task, and a BLM task. We then investigate whether we can obtain compressed sentence representations that encode syntactic and semantic information relevant to several BLM tasks. While we expected that the sentence structure – in terms of sequence of phrases/chunks – and chunk properties could be shared across tasks, performance and error analysis show that the clues for the different tasks are encoded in different manners in the sentence embeddings, suggesting that abstract linguistic notions such as constituents or thematic roles does not seem to be present in the pretrained sentence embeddings.
摘要：我们调查了在多任务环境下，现有的LLM在多大程度上编码意大利语的抽象语言信息。我们利用大规模的精选合成数据–意大利语中的几个黑鸟语言矩阵(BLMS)问题–并用它们来研究使用预先训练的语言模型建立的句子表示如何编码特定的句法和语义信息。我们使用两级体系结构来分别建模将句子嵌入到包含任务和BLM任务的相关信息的表示中。然后，我们调查是否可以获得编码与几个BLM任务相关的句法和语义信息的压缩句子表示。虽然我们预计句子结构–就短语/块的序列而言–和组块属性可以在任务之间共享，但性能和错误分析表明，不同任务的线索在句子嵌入中以不同的方式编码，这表明抽象的语言概念，如成分或主题角色，似乎不存在于预先训练的句子嵌入中。

[NLP-7] Alleviating Hallucinations in Large Language Models with Scepticism Modeling
[NLP-7] 通过怀疑论建模减轻大型语言模型中的幻觉

链接: https://arxiv.org/abs/2409.06601
作者: Yetao Wu,Yihong Wang,Teng Chen,Chenxi Liu,Ningyuan Xi,Qingqing Gu,Hongyang Lei,Zhonglin Jiang,Yong Chen,Luo Ji
关键词-EN: large language models, prevents adoption, diverse fields, major challenge, challenge for large
关键词-ZH: 大型语言模型、阻碍采用、多元化领域、重大挑战、大型挑战
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Hallucinations is a major challenge for large language models (LLMs), prevents adoption in diverse fields. Uncertainty estimation could be used for alleviating the damages of hallucinations. The skeptical emotion of human could be useful for enhancing the ability of self estimation. Inspirited by this observation, we proposed a new approach called Skepticism Modeling (SM). This approach is formalized by combining the information of token and logits for self estimation. We construct the doubt emotion aware data, perform continual pre-training, and then fine-tune the LLMs, improve their ability of self estimation. Experimental results demonstrate this new approach effectively enhances a model’s ability to estimate their uncertainty, and validate its generalization ability of other tasks by out-of-domain experiments.
摘要：幻觉是大型语言模型（LLM）的主要挑战，阻碍了在不同领域的采用。不确定性估计可以用于减轻幻觉的损害。人类的怀疑情绪有助于增强自我评价能力。受到这一观察的启发，我们提出了一种名为怀疑论建模（SM）的新方法。这种方法通过结合代币和日志的信息进行自我估计来形式化。我们构建怀疑情绪感知数据，进行持续的预训练，然后微调LLM，提高他们的自我评估能力。实验结果表明，这种新方法有效地增强了模型估计其不确定性的能力，并通过域外实验验证了其对其他任务的概括能力。

[NLP-8] GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering
[NLP-8] GroUSE：在有针对性的问题回答中评估评估者的基准

链接: https://arxiv.org/abs/2409.06595
作者: Sacha Muller,António Loison,Bilel Omrani,Gautier Viaud
关键词-EN: Retrieval-Augmented Generation, Large Language Models, Large Language, knowledge bases, alongside private
关键词-ZH: 检索增强生成、大型语言模型、大型语言、知识库以及私有
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a common paradigm to use Large Language Models (LLMs) alongside private and up-to-date knowledge bases. In this work, we address the challenges of using LLM-as-a-Judge when evaluating grounded answers generated by RAG systems. To assess the calibration and discrimination capabilities of judge models, we identify 7 generator failure modes and introduce GroUSE (Grounded QA Unitary Scoring of Evaluators), a meta-evaluation benchmark of 144 unit tests. This benchmark reveals that existing automated RAG evaluation frameworks often overlook important failure modes, even when using GPT-4 as a judge. To improve on the current design of automated RAG evaluation frameworks, we propose a novel pipeline and find that while closed models perform well on GroUSE, state-of-the-art open-source judges do not generalize to our proposed criteria, despite strong correlation with GPT-4’s judgement. Our findings suggest that correlation with GPT-4 is an incomplete proxy for the practical performance of judge models and should be supplemented with evaluations on unit tests for precise failure mode detection. We further show that finetuning Llama-3 on GPT-4’s reasoning traces significantly boosts its evaluation capabilities, improving upon both correlation with GPT-4’s evaluations and calibration on reference situations. Subjects: Computation and Language (cs.CL) ACMclasses: I.2.7 Cite as: arXiv:2409.06595 [cs.CL] (or arXiv:2409.06595v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.06595 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：检索增强生成(RAG)已经成为使用大型语言模型(LLM)与私有和最新知识库一起使用的常见范例。在这项工作中，我们解决了在评估RAG系统生成的接地答案时使用LLM作为法官的挑战。为了评估判断模型的校准和识别能力，我们识别了7种发电机故障模式，并引入了由144个单元测试组成的元评估基准–GRAUSE(接地QA评价器单位评分)。这一基准表明，现有的自动RAG评估框架经常忽略重要的故障模式，即使使用GPT-4作为判断。为了改进目前自动RAG评估框架的设计，我们提出了一种新的管道，发现尽管封闭模型在松鸡上表现良好，但最新的开源评判并不能推广到我们提出的标准，尽管与GPT-4的S判断有很强的相关性。我们的发现表明，与GPT-4的相关性不能完全代表判断模型的实际性能，应该补充对单元测试的评估，以精确检测故障模式。我们进一步表明，在GPT-4的S推理轨迹上对Llama-3进行微调显著提高了其评估能力，改善了与GPT-4的S评估的相关性，并对参考情景进行了校正。科目：计算和语言(cs.CL)ACM类：I.2.7引用为：arxiv：2409.06595cs.CL https://doi.org/10.48550/arXiv.2409.06595 Focus通过DataCite了解更多arxiv发布的DOI(待注册)

[NLP-9] Exploring syntactic information in sentence embeddings through multilingual subject-verb agreement
[NLP-9] 通过多语言主动词协议探索句子嵌入中的语法信息

链接: https://arxiv.org/abs/2409.06567
作者: Vivi Nastase,Chunyang Jiang,Giuseppe Samo,Paola Merlo
关键词-EN: cross-linguistically valid abstract, pretrained language models, capture cross-linguistically valid, models capture cross-linguistically, valid abstract linguistic
关键词-ZH: 跨语言有效的抽象、预训练的语言模型，捕获跨语言有效，模型捕获跨语言有效的抽象语言
类目: Computation and Language (cs.CL)
备注: 11 pages, 5 tables, 5 figures

点击查看摘要

Abstract:In this paper, our goal is to investigate to what degree multilingual pretrained language models capture cross-linguistically valid abstract linguistic representations. We take the approach of developing curated synthetic data on a large scale, with specific properties, and using them to study sentence representations built using pretrained language models. We use a new multiple-choice task and datasets, Blackbird Language Matrices (BLMs), to focus on a specific grammatical structural phenomenon – subject-verb agreement across a variety of sentence structures – in several languages. Finding a solution to this task requires a system detecting complex linguistic patterns and paradigms in text representations. Using a two-level architecture that solves the problem in two steps – detect syntactic objects and their properties in individual sentences, and find patterns across an input sequence of sentences – we show that despite having been trained on multilingual texts in a consistent manner, multilingual pretrained language models have language-specific differences, and syntactic structure is not shared, even across closely related languages.
摘要：在本文中，我们的目标是调查多语言预训练语言模型在多大程度上捕捉跨语言有效的抽象语言表征。我们采取的方法是大规模开发具有特定性质的精选合成数据，并用它们来研究使用预先训练的语言模型建立的句子表示。我们使用一种新的多项选择任务和数据集–黑鸟语言矩阵(BLMS)–来关注几种语言中的一种特定语法结构现象–各种句型的主谓一致。要找到这一任务的解决方案，需要一个系统来检测文本表示中的复杂语言模式和范例。使用一个两级体系结构来解决这个问题，分两个步骤–检测单个句子中的句法对象及其属性，并在输入句子序列中找到模式–我们表明，尽管以一致的方式对多语言文本进行了训练，但多语言预先训练的语言模型有特定于语言的差异，句法结构不共享，即使在密切相关的语言之间也是如此。

[NLP-10] From LIMA to DeepLIMA: following a new path of interoperability
[NLP-10] 从LIMA到DeepLIMA：走一条新的互操作性道路

链接: https://arxiv.org/abs/2409.06550
作者: Victor Bocharov,Romaric Besançon,Gaël de Chalendar,Olivier Ferret,Nasredine Semmar
关键词-EN: Libre Multilingual Analyzer, Libre Multilingual, Multilingual Analyzer, text analysis modules, analysis modules based
关键词-ZH: Libre多语言分析器，Libre多语言，多语言分析器，文本分析模块，基于的分析模块
类目: Computation and Language (cs.CL)
备注: 16 pages, 5 figures, submitted to Language Resources and Evaluation

点击查看摘要

Abstract:In this article, we describe the architecture of the LIMA (Libre Multilingual Analyzer) framework and its recent evolution with the addition of new text analysis modules based on deep neural networks. We extended the functionality of LIMA in terms of the number of supported languages while preserving existing configurable architecture and the availability of previously developed rule-based and statistical analysis components. Models were trained for more than 60 languages on the Universal Dependencies 2.5 corpora, WikiNer corpora, and CoNLL-03 dataset. Universal Dependencies allowed us to increase the number of supported languages and to generate models that could be integrated into other platforms. This integration of ubiquitous Deep Learning Natural Language Processing models and the use of standard annotated collections using Universal Dependencies can be viewed as a new path of interoperability, through the normalization of models and data, that are complementary to a more standard technical interoperability, implemented in LIMA through services available in Docker containers on Docker Hub.
摘要：在本文中，我们描述了LIMA(LIBRE多语言分析器)框架的体系结构及其最新进展，并添加了基于深度神经网络的新的文本分析模块。我们在支持的语言数量方面扩展了LIMA的功能，同时保留了现有的可配置架构以及以前开发的基于规则和统计分析组件的可用性。在通用依存关系2.5语料库、WikiNer语料库和CoNLL-03数据集上，对60多种语言的模型进行了培训。通用依赖关系使我们能够增加支持的语言数量，并生成可以集成到其他平台中的模型。这种无处不在的深度学习自然语言处理模型的集成以及使用通用依赖关系的标准标注集合的使用可以被视为一条互操作性的新途径，通过模型和数据的标准化，这些模型和数据是对更标准的技术互操作性的补充，在利马通过Docker Hub上的Docker Containers中提供的服务实施。

[NLP-11] Mapping News Narratives Using LLMs and Narrative-Structured Text Embeddings
[NLP-11] 使用LLM和叙事结构文本嵌入绘制新闻叙事

链接: https://arxiv.org/abs/2409.06540
作者: Jan Elfes
关键词-EN: international politics, development over time, profound impact, personal identities, identities to international
关键词-ZH: 国际政治，随时间的发展，深刻影响，个人身份，国际身份
类目: Computation and Language (cs.CL)
备注: 19 pages, 13 figures, 4 tables

点击查看摘要

Abstract:Given the profound impact of narratives across various societal levels, from personal identities to international politics, it is crucial to understand their distribution and development over time. This is particularly important in online spaces. On the Web, narratives can spread rapidly and intensify societal divides and conflicts. While many qualitative approaches exist, quantifying narratives remains a significant challenge. Computational narrative analysis lacks frameworks that are both comprehensive and generalizable. To address this gap, we introduce a numerical narrative representation grounded in structuralist linguistic theory. Chiefly, Greimas’ Actantial Model represents a narrative through a constellation of six functional character roles. These so-called actants are genre-agnostic, making the model highly generalizable. We extract the actants using an open-source LLM and integrate them into a Narrative-Structured Text Embedding that captures both the semantics and narrative structure of a text. We demonstrate the analytical insights of the method on the example of 5000 full-text news articles from Al Jazeera and The Washington Post on the Israel-Palestine conflict. Our method successfully distinguishes articles that cover the same topics but differ in narrative structure.
摘要：鉴于叙事在各个社会层面，从个人身份到国际政治的深刻影响，了解它们的分布和随时间的发展是至关重要的。这一点在网络空间中尤为重要。在网络上，叙事可以迅速传播，加剧社会分歧和冲突。虽然存在许多定性的方法，但量化叙述仍然是一个巨大的挑战。计算性叙事分析缺乏既全面又可概括的框架。为了解决这一差距，我们引入了一种基于结构主义语言理论的数字叙事表征。主要是，格雷马斯的实际模式通过六个功能角色的星座表现了一个叙事。这些所谓的作用者是体裁不可知的，这使得该模型具有高度的概括性。我们使用开源的LLM来提取动因，并将它们集成到一个叙事结构的文本中，该文本嵌入了文本的语义和叙事结构。我们以半岛电视台和《华盛顿邮报》关于以巴冲突的5000篇全文新闻文章为例，展示了该方法的分析见解。我们的方法成功地区分了主题相同但叙事结构不同的文章。

[NLP-12] Questioning Internal Knowledge Structure of Large Language Models Through the Lens of the Olympic Games
[NLP-12] 以奥运会为视角质疑大型语言模型的内部知识结构

链接: https://arxiv.org/abs/2409.06518
作者: Juhwan Choi,YoungBin Kim
关键词-EN: Large language models, natural language processing, internal knowledge structures, remain largely unexplored, Large language
关键词-ZH: 大型语言模型、自然语言处理、内部知识结构在很大程度上仍未被探索，大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have become a dominant approach in natural language processing, yet their internal knowledge structures remain largely unexplored. In this paper, we analyze the internal knowledge structures of LLMs using historical medal tallies from the Olympic Games. We task the models with providing the medal counts for each team and identifying which teams achieved specific rankings. Our results reveal that while state-of-the-art LLMs perform remarkably well in reporting medal counts for individual teams, they struggle significantly with questions about specific rankings. This suggests that the internal knowledge structures of LLMs are fundamentally different from those of humans, who can easily infer rankings from known medal counts. To support further research, we publicly release our code, dataset, and model outputs.
摘要：大型语言模型（LLM）已成为自然语言处理中的主导方法，但其内部知识结构在很大程度上仍未被探索。本文利用奥运会的历史奖牌统计来分析LLM的内部知识结构。我们的任务是为每个球队提供奖牌数并识别哪些球队获得了特定排名。我们的结果显示，虽然最先进的LLM在报告单个团队的奖牌数方面表现非常出色，但他们在处理有关特定排名的问题时却遇到了很大的困难。这表明LLM的内部知识结构与人类的内部知识结构有根本不同，人类可以轻松地从已知的奖牌数中推断排名。为了支持进一步的研究，我们公开发布我们的代码、数据集和模型输出。

[NLP-13] An Effective Context-Balanced Adaptation Approach for Long-Tailed Speech Recognition
[NLP-13] 一种有效的长尾语音识别上下文平衡自适应方法

链接: https://arxiv.org/abs/2409.06468
作者: Yi-Cheng Wang,Li-Ting Pai,Bi-Cheng Yan,Hsin-Wei Wang,Chi-Han Lin,Berlin Chen
关键词-EN: automatic speech recognition, ASR models, context list, ASR, automatic speech
关键词-ZH: 自动语音识别、ASB模型、上下文列表、ASB、自动语音
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted by SLT 2024

点击查看摘要

Abstract:End-to-end (E2E) automatic speech recognition (ASR) models have become standard practice for various commercial applications. However, in real-world scenarios, the long-tailed nature of word distribution often leads E2E ASR models to perform well on common words but fall short in recognizing uncommon ones. Recently, the notion of a contextual adapter (CA) was proposed to infuse external knowledge represented by a context word list into E2E ASR models. Although CA can improve recognition performance on rare words, two crucial data imbalance problems remain. First, when using low-frequency words as context words during training, since these words rarely occur in the utterance, CA becomes prone to overfit on attending to the no-context token due to higher-frequency words not being present in the context list. Second, the long-tailed distribution within the context list itself still causes the model to perform poorly on low-frequency context words. In light of this, we explore in-depth the impact of altering the context list to have words with different frequency distributions on model performance, and meanwhile extend CA with a simple yet effective context-balanced learning objective. A series of experiments conducted on the AISHELL-1 benchmark dataset suggests that using all vocabulary words from the training corpus as the context list and pairing them with our balanced objective yields the best performance, demonstrating a significant reduction in character error rate (CER) by up to 1.21% and a more pronounced 9.44% reduction in the error rate of zero-shot words.
摘要：端到端(E2E)自动语音识别(ASR)模型已成为各种商业应用的标准实践。然而，在现实世界的场景中，单词分布的长尾性质往往导致E2E ASR模型在常见单词上表现良好，但在识别不常见单词方面表现不佳。最近，提出了语境适配器(CA)的概念，将由语境词表表示的外部知识注入到E2E ASR模型中。虽然CA可以提高对稀有单词的识别性能，但仍然存在两个关键的数据失衡问题。首先，当在训练过程中使用低频词作为上下文词时，由于这些词很少出现在话语中，由于高频词不存在于上下文列表中，CA在关注无上下文标记时变得过于适合。其次，上下文列表本身中的长尾分布仍然导致模型在低频上下文词上的性能较差。有鉴于此，我们深入探讨了改变上下文列表以包含不同频率分布的单词对模型性能的影响，同时用一个简单而有效的上下文平衡学习目标来扩展CA。在AISHELL-1基准数据集上进行的一系列实验表明，使用训练语料库中的所有词汇作为上下文列表并将它们与我们的平衡目标配对可以获得最好的性能，字符错误率(CER)显著降低1.21%，零中词错误率降低9.44%。

[NLP-14] HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data
[NLP-14] HexaCoder：通过Oracle引导的合成训练数据安全生成代码

链接: https://arxiv.org/abs/2409.06446
作者: Hossein Hajipour,Lea Schönherr,Thorsten Holz,Mario Fritz
关键词-EN: shown great potential, GitHub Copilot, Large language models, Large language, shown great
关键词-ZH: 显示出巨大的潜力，GitHub Copilot，大型语言模型，大型语言，显示出伟大的
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: 24 pages, 16 tables, 8 figures

点击查看摘要

Abstract:Large language models (LLMs) have shown great potential for automatic code generation and form the basis for various tools such as GitHub Copilot. However, recent studies highlight that many LLM-generated code contains serious security vulnerabilities. While previous work tries to address this by training models that generate secure code, these attempts remain constrained by limited access to training data and labor-intensive data preparation. In this paper, we introduce HexaCoder, a novel approach to enhance the ability of LLMs to generate secure codes by automatically synthesizing secure codes, which reduces the effort of finding suitable training data. HexaCoder comprises two key components: an oracle-guided data synthesis pipeline and a two-step process for secure code generation. The data synthesis pipeline generates pairs of vulnerable and fixed codes for specific Common Weakness Enumeration (CWE) types by utilizing a state-of-the-art LLM for repairing vulnerable code. A security oracle identifies vulnerabilities, and a state-of-the-art LLM repairs them by extending and/or editing the codes, creating data pairs for fine-tuning using the Low-Rank Adaptation (LoRA) method. Each example of our fine-tuning dataset includes the necessary security-related libraries and code that form the basis of our novel two-step generation approach. This allows the model to integrate security-relevant libraries before generating the main code, significantly reducing the number of generated vulnerable codes by up to 85% compared to the baseline methods. We perform extensive evaluations on three different benchmarks for four LLMs, demonstrating that HexaCoder not only improves the security of the generated code but also maintains a high level of functional correctness. Comments: 24 pages, 16 tables, 8 figures Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE) Cite as: arXiv:2409.06446 [cs.CR] (or arXiv:2409.06446v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2409.06446 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：大型语言模型(LLM)在自动代码生成方面显示出巨大的潜力，并构成了GitHub Copilot等各种工具的基础。然而，最近的研究强调，许多LLM生成的代码包含严重的安全漏洞。虽然以前的工作试图通过生成安全代码的培训模型来解决这一问题，但这些尝试仍然受到培训数据访问有限和劳动密集型数据准备的限制。本文介绍了一种新的方法HexaCoder，它通过自动合成安全代码来增强LLMS生成安全代码的能力，从而减少了寻找合适的训练数据的工作量。HexaCoder包括两个关键组件：甲骨文指导的数据合成管道和用于安全代码生成的两步过程。数据合成流水线通过利用最先进的LLM来修复易受攻击的代码，为特定的通用弱点枚举(CWE)类型生成易受攻击的和固定的代码对。安全先知识别漏洞，而最先进的LLM通过扩展和/或编辑代码来修复它们，创建数据对以使用低级适应(LORA)方法进行微调。我们微调数据集的每个示例都包括必要的与安全相关的库和代码，这些库和代码构成了我们新颖的两步生成方法的基础。这允许该模型在生成主代码之前集成与安全相关的库，与基准方法相比，显著减少了生成的易受攻击代码的数量高达85%。我们在四个LLM的三个不同基准上进行了广泛的评估，证明了HexaCoder不仅提高了生成代码的安全性，而且保持了高水平的功能正确性。评论：24页，16张表格，8位数字主题：密码学与安全(cs.CR)；人工智能(cs.AI)；计算与语言(cs.CL)；机器学习(cs.LG)；软件工程(cs.SE)引用为：arxiv：2409.06446cs.CR https://doi.org/10.48550/arXiv.2409.06446 Focus通过DataCite(待注册)了解更多arxiv发布的DOI

[NLP-15] Length Desensitization in Directed Preference Optimization
[NLP-15] 定向偏好优化中的长度脱敏

链接: https://arxiv.org/abs/2409.06411
作者: Wei Liu,Yang Bai,Chengcheng Han,Rongxiang Weng,Jun Xu,Xuezhi Cao,Jingang Wang,Xunliang Cai
关键词-EN: Large Language Models, align Large Language, Large Language, Human Feedback, Direct Preference Optimization
关键词-ZH: 大型语言模型、对齐大型语言、大型语言、人类反馈、直接偏好优化
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 21 pages, 9 figures

点击查看摘要

Abstract:Direct Preference Optimization (DPO) is widely utilized in the Reinforcement Learning from Human Feedback (RLHF) phase to align Large Language Models (LLMs) with human preferences, thereby enhancing both their harmlessness and efficacy. However, it has been observed that DPO tends to over-optimize for verbosity, which can detrimentally affect both performance and user experience. In this paper, we conduct an in-depth theoretical analysis of DPO’s optimization objective and reveal a strong correlation between its implicit reward and data length. This correlation misguides the optimization direction, resulting in length sensitivity during the DPO training and leading to verbosity. To address this issue, we propose a length-desensitization improvement method for DPO, termed LD-DPO. The proposed method aims to desensitize DPO to data length by decoupling explicit length preference, which is relatively insignificant, from the other implicit preferences, thereby enabling more effective learning of the intrinsic preferences. We utilized two settings (Base and Instruct) of Llama2-13B, Llama3-8B, and Qwen2-7B for experimental validation on various benchmarks including MT-Bench and AlpacaEval 2. The experimental results indicate that LD-DPO consistently outperforms DPO and other baseline methods, achieving more concise responses with a 10-40% reduction in length compared to DPO. We conducted in-depth experimental analyses to demonstrate that LD-DPO can indeed achieve length desensitization and align the model more closely with human-real preferences.
摘要：直接偏好优化(DPO)被广泛应用于人类反馈强化学习(RLHF)阶段，以使大语言模型(LLM)与人类偏好相匹配，从而提高其无害性和有效性。但是，已经观察到DPO倾向于针对冗长进行过度优化，这可能会对性能和用户体验造成不利影响。本文对DPO的优化目标进行了深入的理论分析，发现其隐含报酬与数据长度之间存在很强的相关性。这种关联误导了优化方向，导致在DPO训练期间对长度敏感，并导致冗长。为了解决这个问题，我们提出了一种DPO的长度脱敏改进方法，称为LD-DPO。该方法旨在通过将相对不显著的显式长度偏好与其他隐式偏好分离，从而使DPO能够更有效地学习内在偏好，从而使DPO对数据长度不敏感。我们使用Llama2-13B、Llama3-8B和Qwen2-7B的两种设置(Base和Indict)在MT-BENCH和AlpacaEval 2等不同的基准测试上进行了实验验证。实验结果表明，LD-DPO的性能一直优于DPO和其他基线方法，获得了更简洁的响应，与DPO相比，长度减少了10-40%。我们进行了深入的实验分析，以证明LD-DPO确实可以实现长度脱敏，并使模型更接近人类真实的偏好。

[NLP-16] Coarse-Grained Sense Inventories Based on Semantic Matching between English Dictionaries
[NLP-16] 基于英语词典语义匹配的粗粒度意义清单

链接: https://arxiv.org/abs/2409.06386
作者: Masato Kikuchi,Masatsugu Ono,Toshioki Soga,Tetsu Tanabe,Tadachika Ozono
关键词-EN: largest handcrafted concept, visualizing word connections, handcrafted concept dictionaries, concept dictionaries visualizing, dictionaries visualizing word
关键词-ZH: 最大的手工概念，可视化单词连接，手工概念词典，概念词典可视化，词典可视化
类目: Computation and Language (cs.CL)
备注: The 11th International Conference on Advanced Informatics: Concepts, Theory and Applications (ICAICTA 2024)

点击查看摘要

Abstract:WordNet is one of the largest handcrafted concept dictionaries visualizing word connections through semantic relationships. It is widely used as a word sense inventory in natural language processing tasks. However, WordNet’s fine-grained senses have been criticized for limiting its usability. In this paper, we semantically match sense definitions from Cambridge dictionaries and WordNet and develop new coarse-grained sense inventories. We verify the effectiveness of our inventories by comparing their semantic coherences with that of Coarse Sense Inventory. The advantages of the proposed inventories include their low dependency on large-scale resources, better aggregation of closely related senses, CEFR-level assignments, and ease of expansion and improvement.
摘要：WordNet是最大的手工概念词典之一，通过语义关系可视化单词连接。它被广泛用作自然语言处理任务中的词意清单。然而，WordNet的细粒度感官因限制其可用性而受到批评。在本文中，我们对剑桥词典和WordNet的意义定义进行了语义匹配，并开发了新的粗粒度意义清单。我们通过将清单的语义连贯性与粗感清单进行比较来验证我们清单的有效性。拟议库存的优势包括对大规模资源的依赖性低、密切相关感官的更好聚集、CEFR级别的任务以及易于扩展和改进。

[NLP-17] Enhancing Sequential Recommendations through Multi-Perspective Reflections and Iteration
[NLP-17] 通过多视角反思和迭代增强顺序推荐

链接: https://arxiv.org/abs/2409.06377
作者: Weicong Qin,Yi Xu,Weijie Yu,Chenglei Shen,Xiao Zhang,Ming He,Jianping Fan,Jun Xu
关键词-EN: understanding user intentions, Sequence recommendation, collaborative filtering information, aims to predict, leveraging collaborative filtering
关键词-ZH: 了解用户意图、序列推荐、协同过滤信息、旨在预测、利用协同过滤
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: First 3 authors contributes equally to this work

点击查看摘要

Abstract:Sequence recommendation (SeqRec) aims to predict the next item a user will interact with by understanding user intentions and leveraging collaborative filtering information. Large language models (LLMs) have shown great promise in recommendation tasks through prompt-based, fixed reflection libraries, and fine-tuning techniques. However, these methods face challenges, including lack of supervision, inability to optimize reflection sources, inflexibility to diverse user needs, and high computational costs. Despite promising results, current studies primarily focus on reflections of users’ explicit preferences (e.g., item titles) while neglecting implicit preferences (e.g., brands) and collaborative filtering information. This oversight hinders the capture of preference shifts and dynamic user behaviors. Additionally, existing approaches lack mechanisms for reflection evaluation and iteration, often leading to suboptimal recommendations. To address these issues, we propose the Mixture of REflectors (MoRE) framework, designed to model and learn dynamic user preferences in SeqRec. Specifically, MoRE introduces three reflectors for generating LLM-based reflections on explicit preferences, implicit preferences, and collaborative signals. Each reflector incorporates a self-improving strategy, termed refining-and-iteration, to evaluate and iteratively update reflections. Furthermore, a meta-reflector employs a contextual bandit algorithm to select the most suitable expert and corresponding reflections for each user’s recommendation, effectively capturing dynamic preferences. Extensive experiments on three real-world datasets demonstrate that MoRE consistently outperforms state-of-the-art methods, requiring less training time and GPU memory compared to other LLM-based approaches in SeqRec.
摘要：序列推荐(SeqRec)旨在通过理解用户意图和利用协作过滤信息来预测用户将交互的下一项。大型语言模型(LLM)通过基于提示的固定反射库和微调技术在推荐任务中显示出了巨大的前景。然而，这些方法面临着挑战，包括缺乏监督，无法优化反射源，对不同的用户需求缺乏灵活性，以及较高的计算成本。尽管结果令人振奋，但目前的研究主要集中在反映用户的显性偏好(例如项目名称)，而忽略了隐含偏好(例如品牌)和协作过滤信息。这种疏忽阻碍了对偏好变化和动态用户行为的捕获。此外，现有的方法缺乏反思评估和迭代的机制，往往导致次优建议。为了解决这些问题，我们提出了混合反射器(More)框架，旨在对SeqRec中的动态用户偏好进行建模和学习。具体地说，More介绍了三个反射器，用于生成基于LLM的关于显性偏好、隐性偏好和协作信号的反射。每个反射器都包含一种称为改进和迭代的自我改进策略，以评估和迭代更新反射。此外，元反射器使用上下文强盗算法为每个用户的推荐选择最合适的专家和相应的反映，有效地捕获动态偏好。在三个真实世界的数据集上的广泛实验表明，与SeqRec中其他基于LLM的方法相比，该方法的性能更一致地优于最先进的方法，所需的训练时间和GPU内存更少。

[NLP-18] SpeechTaxi: On Multilingual Semantic Speech Classification
[NLP-18] SpeechTaxi：多语言语音语义分类

链接: https://arxiv.org/abs/2409.06372
作者: Lennart Keller,Goran Glavaš
关键词-EN: Recent advancements, semantic speech classification, multilingual speech encoding, raise the question, speech classification
关键词-ZH: 最近的进展，语义语音分类，多语言语音编码，提出了问题，语音分类
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Recent advancements in multilingual speech encoding as well as transcription raise the question of the most effective approach to semantic speech classification. Concretely, can (1) end-to-end (E2E) classifiers obtained by fine-tuning state-of-the-art multilingual speech encoders (MSEs) match or surpass the performance of (2) cascading (CA), where speech is first transcribed into text and classification is delegated to a text-based classifier. To answer this, we first construct SpeechTaxi, an 80-hour multilingual dataset for semantic speech classification of Bible verses, covering 28 diverse languages. We then leverage SpeechTaxi to conduct a wide range of experiments comparing E2E and CA in monolingual semantic speech classification as well as in cross-lingual transfer. We find that E2E based on MSEs outperforms CA in monolingual setups, i.e., when trained on in-language data. However, MSEs seem to have poor cross-lingual transfer abilities, with E2E substantially lagging CA both in (1) zero-shot transfer to languages unseen in training and (2) multilingual training, i.e., joint training on multiple languages. Finally, we devise a novel CA approach based on transcription to Romanized text as a language-agnostic intermediate representation and show that it represents a robust solution for languages without native ASR support. Our SpeechTaxi dataset is publicly available at: this https URL datasets/LennartKeller/SpeechTaxi/.
摘要：多语种语音编码和转录的最新进展提出了语义语音分类的最有效方法的问题。具体地说，可以(1)通过微调最先进的多语言语音编码器(MSE)获得的端到端(E2E)分类器与(2)级联(CA)的性能相当或超过，其中语音首先被转录成文本，并且分类被委托给基于文本的分类器。为了回答这个问题，我们首先构建了一个涵盖28种不同语言的80小时多语种数据集SpeechTaxi，用于对圣经经文进行语义语音分类。然后，我们利用SpeechTaxi进行了大量的实验，比较了E2E和CA在单语语义语音分类以及跨语言迁移方面的差异。我们发现，基于MSE的E2E在单语言设置上优于CA，即在语言内数据上进行训练。然而，MSE的跨语言迁移能力似乎很差，E2E在以下两个方面明显落后于CA：(1)零机会迁移到培训中看不到的语言；(2)多语言培训，即多语言联合培训。最后，我们设计了一种新的基于罗马文本转录的CA方法，作为一种与语言无关的中间表示，并证明了它对于没有本地ASR支持的语言是一个健壮的解决方案。我们的SpeechTaxi数据集可在以下网址公开获得：This HTTPS URL Datasets/LennartKeller/SpeechTaxi/。

[NLP-19] Retrieval Or Holistic Understanding? Dolce: Differentiate Our Long Context Evaluation Tasks
[NLP-19] 检索还是整体理解？Dolce：区分我们的长期背景评估任务

链接: https://arxiv.org/abs/2409.06338
作者: Zi Yang
关键词-EN: major distinct capabilities, holistic understanding focused, holistic understanding, long context understanding, long context capabilities
关键词-ZH: 主要独特的能力，以整体理解为重点，整体理解，长背景理解，长背景能力
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We argue that there are two major distinct capabilities in long context understanding: retrieval and holistic understanding. Understanding and further improving LLMs’ long context capabilities would not be possible without knowing the tasks’ focus categories. We aim to automatically identify retrieval focused and holistic understanding focused problems from suites of benchmarks and quantitatively measure the difficulty within each focus. In this paper, we present the Dolce framework, which parameterizes each problem by \lambda (complexity) and k (redundancy) and assigns to one of five predefined focus categories. We propose to sample short contexts from the full context and estimate the probability an LLM solves the problem using the sampled spans. To find the \lambda and k for each problem, we further propose a mixture model of a non-parametric background noise component and a parametric/non-parametric hybrid oracle component, where we derive the probability functions parameterized by \lambda and k for both the correct-or-wrong (COW) scenario and the partial-point-in-grading (PIG) scenario. Our proposed methods can identify 0% to 67% of the problems are retrieval focused and 0% to 90% of the problems are holistic understanding focused across 44 existing long context evaluation tasks.
摘要：我们认为长语境理解有两种主要的截然不同的能力：提取和整体理解。如果不知道任务的重点类别，就不可能理解和进一步提高LLMS的长上下文能力。我们的目标是从一整套基准中自动识别检索重点和整体理解重点问题，并定量衡量每个重点中的难度。在本文中，我们提出了Dolce框架，它通过\lambda(复杂性)和k(冗余度)来参数化每个问题，并分配到五个预定义的焦点类别中的一个。我们建议从完整的上下文中对短上下文进行采样，并使用采样的跨度估计LLM解决问题的概率。为了找出每个问题的λ和k，我们进一步提出了一个非参数背景噪声分量和参数/非参数混合Oracle分量的混合模型，其中我们推导了正确或错误(CoW)场景和部分点评分(PIG)场景下由\lambda和k参数表示的概率函数。我们提出的方法可以识别出0%到67%的问题集中在检索上，0%到90%的问题集中在44个现有的长上下文评估任务中的整体理解上。

[NLP-20] Extracting Paragraphs from LLM Token Activations
[NLP-20] 从LLM代币激活中提取段落

链接: https://arxiv.org/abs/2409.06328
作者: Nicholas Pochinkov,Angelo Benoit,Lovkush Agarwal,Zainab Ali Majid,Lucile Ter-Minassian
关键词-EN: Generative large language, language processing tasks, natural language processing, workings remain underexplored, large language models
关键词-ZH: 生成性大型语言、语言处理任务、自然语言处理、工作方式仍然未充分探索、大型语言模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generative large language models (LLMs) excel in natural language processing tasks, yet their inner workings remain underexplored beyond token-level predictions. This study investigates the degree to which these models decide the content of a paragraph at its onset, shedding light on their contextual understanding. By examining the information encoded in single-token activations, specifically the “\textbackslash n\textbackslash n” double newline token, we demonstrate that patching these activations can transfer significant information about the context of the following paragraph, providing further insights into the model’s capacity to plan ahead.
摘要：生成式大型语言模型（LLM）在自然语言处理任务中表现出色，但它们的内部工作原理在代币级预测之外仍然未得到充分探索。这项研究调查了这些模型在多大程度上决定了段落开头的内容，从而揭示了他们对上下文的理解。通过检查单令牌激活中编码的信息，特别是“\textbackslash n\textbackslash n”双白线令牌，我们证明修补这些激活可以传输有关以下段落上下文的重要信息，从而进一步深入了解模型提前计划的能力。

[NLP-21] Keyword-Aware ASR Error Augmentation for Robust Dialogue State Tracking
[NLP-21] 关键字感知的ASB错误增强以实现稳健的对话状态跟踪

链接: https://arxiv.org/abs/2409.06263
作者: Jihyun Lee,Solee Im,Wonjun Lee,Gary Geunbae Lee
关键词-EN: Dialogue State Tracking, State Tracking, identifying important information, Automatic Speech Recognition, task-oriented dialogue systems
关键词-ZH: 对话状态跟踪、状态跟踪、识别重要信息、自动语音识别、面向任务的对话系统
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dialogue State Tracking (DST) is a key part of task-oriented dialogue systems, identifying important information in conversations. However, its accuracy drops significantly in spoken dialogue environments due to named entity errors from Automatic Speech Recognition (ASR) systems. We introduce a simple yet effective data augmentation method that targets those entities to improve the robustness of DST model. Our novel method can control the placement of errors using keyword-highlighted prompts while introducing phonetically similar errors. As a result, our method generated sufficient error patterns on keywords, leading to improved accuracy in noised and low-accuracy ASR environments.
摘要：对话状态跟踪（DST）是面向任务的对话系统的关键部分，用于识别对话中的重要信息。然而，由于自动语音识别（ASB）系统的命名实体错误，其在口语对话环境中的准确性显着下降。我们引入了一种简单而有效的数据增强方法，该方法针对这些实体，以提高DST模型的鲁棒性。我们的新颖方法可以使用关键字突出显示的提示来控制错误的放置，同时引入语音相似的错误。因此，我们的方法在关键词上生成了足够的错误模式，从而提高了有噪音和低准确性的ASB环境中的准确性。

[NLP-22] Inference is All You Need: Self Example Retriever for Cross-domain Dialogue State Tracking with ChatGPT
[NLP-22] 推理即可：使用ChatGPT进行跨域对话状态跟踪的Self Example检索器

链接: https://arxiv.org/abs/2409.06243
作者: Jihyun Lee,Gary Geunbae Lee
关键词-EN: Traditional dialogue state, approaches heavily rely, extensive training data, tracking approaches heavily, dialogue state tracking
关键词-ZH: 传统对话状态、严重依赖方法、广泛的训练数据、严重跟踪方法、对话状态跟踪
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Traditional dialogue state tracking approaches heavily rely on extensive training data and handcrafted features, limiting their scalability and adaptability to new domains. In this paper, we propose a novel method that leverages inference and in-context learning with ChatGPT for domain transfer in dialogue state tracking, without any parameter updates. By guiding ChatGPT’s chain of thought, we enable it to retrieve relevant examples and generalize knowledge to accurately infer dialogue states, solely through inference. Experimental results on the MultiWOZ dataset demonstrate competitive performance and promising generalization across domains. Our parameter-free approach offers a scalable and adaptable solution, opening new research directions in domain transfer learning.
摘要：传统的对话状态跟踪方法严重依赖于大量的训练数据和手工制作的特征，限制了其对新领域的可扩展性和适应性。在本文中，我们提出了一种新颖的方法，该方法利用ChatGPT的推理和上下文学习进行对话状态跟踪中的域转移，无需任何参数更新。通过引导ChatGPT的思想链，我们使其能够检索相关示例并概括知识，以仅通过推理准确地推断对话状态。MultiWOZ数据集的实验结果证明了具有竞争力的性能和有前途的跨领域通用性。我们的无参数方法提供了可扩展和适应性强的解决方案，为领域迁移学习开辟了新的研究方向。

[NLP-23] NLP-Powered Repository and Search Engine for Academic Papers: A Case Study on Cyber Risk Literature with CyLit
[NLP-23] NLP支持的学术论文知识库和搜索引擎：CyLit网络风险文献案例研究

链接: https://arxiv.org/abs/2409.06226
作者: Linfeng Zhang,Changyue Hu,Zhiyu Quan
关键词-EN: face increasing difficulties, researchers face increasing, academic literature continues, academic literature, continues to grow
关键词-ZH: 面临越来越多的困难，研究人员面临越来越多，学术文献继续，学术文献继续增长
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Risk Management (q-fin.RM)
备注:

点击查看摘要

Abstract:As the body of academic literature continues to grow, researchers face increasing difficulties in effectively searching for relevant resources. Existing databases and search engines often fall short of providing a comprehensive and contextually relevant collection of academic literature. To address this issue, we propose a novel framework that leverages Natural Language Processing (NLP) techniques. This framework automates the retrieval, summarization, and clustering of academic literature within a specific research domain. To demonstrate the effectiveness of our approach, we introduce CyLit, an NLP-powered repository specifically designed for the cyber risk literature. CyLit empowers researchers by providing access to context-specific resources and enabling the tracking of trends in the dynamic and rapidly evolving field of cyber risk. Through the automatic processing of large volumes of data, our NLP-powered solution significantly enhances the efficiency and specificity of academic literature searches. We compare the literature categorization results of CyLit to those presented in survey papers or generated by ChatGPT, highlighting the distinctive insights this tool provides into cyber risk research literature. Using NLP techniques, we aim to revolutionize the way researchers discover, analyze, and utilize academic resources, ultimately fostering advancements in various domains of knowledge.
摘要：随着学术文献量的不断增长，研究人员在有效地搜索相关资源方面面临着越来越大的困难。现有的数据库和搜索引擎往往不能提供全面的、与背景相关的学术文献集合。为了解决这个问题，我们提出了一个利用自然语言处理(NLP)技术的新框架。该框架自动化了对特定研究领域内的学术文献的检索、摘要和聚集。为了证明我们方法的有效性，我们引入了CyLit，这是一个专门为网络风险文献设计的NLP支持的存储库。CyLit通过提供对特定背景资源的访问并使其能够跟踪动态和快速发展的网络风险领域的趋势，从而增强研究人员的能力。通过自动处理大量数据，我们的NLP支持的解决方案显著提高了学术文献搜索的效率和专门性。我们将CyLit的文献分类结果与调查论文中提供的结果或由ChatGPT生成的结果进行了比较，突出了该工具为网络风险研究文献提供的独特见解。使用NLP技术，我们的目标是彻底改变研究人员发现、分析和利用学术资源的方式，最终促进各个知识领域的进步。

[NLP-24] Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models
[NLP-24] 增强大型音频语言模型音频问题回答中的时间理解

链接: https://arxiv.org/abs/2409.06223
作者: Arvind Krishna Sridhar,Yinyi Guo,Erik Visser
关键词-EN: Audio Question Answering, Large Audio Language, Audio Language Models, Large Language Models, Question Answering
关键词-ZH: 音频问题解答，大型音频语言，音频语言模型，大型语言模型，问题解答
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 5 pages, 3 figures

点击查看摘要

Abstract:The Audio Question Answering task includes audio event classification, audio captioning, and open ended reasoning. Recently, Audio Question Answering has garnered attention due to the advent of Large Audio Language Models. Current literature focuses on constructing LALMs by integrating audio encoders with text only Large Language Models through a projection module. While Large Audio Language Models excel in general audio understanding, they are limited in temporal reasoning which may hinder their commercial applications and on device deployment. This paper addresses these challenges and limitations in audio temporal reasoning. First, we introduce a data augmentation technique for generating reliable audio temporal questions and answers using an LLM. Second, we propose a continued finetuning curriculum learning strategy to specialize in temporal reasoning without compromising performance on finetuned tasks. Finally, we develop a reliable and transparent automated metric, assisted by an LLM, to measure the correlation between Large Audio Language Model responses and ground truth data intelligently. We demonstrate the effectiveness of our proposed techniques using SOTA LALMs on public audio benchmark datasets.
摘要：音频问答任务包括音频事件分类、音频字幕和开放式推理。最近，由于大型音频语言模型的出现，音频问答引起了人们的关注。目前的文献集中于通过投影模块将音频编码器与纯文本的大型语言模型相集成来构建LALM。虽然大型音频语言模型在一般音频理解方面表现出色，但它们在时间推理方面存在局限性，这可能会阻碍它们的商业应用和设备部署。本文讨论了音频时序推理中的这些挑战和限制。首先，我们介绍了一种数据增强技术，用于使用LLM生成可靠的音频时间问答。其次，我们提出了一种持续精调课程学习策略，在不影响精调任务性能的情况下，专注于时间推理。最后，我们开发了一个可靠和透明的自动化度量，在LLM的辅助下，智能地度量大型音频语言模型响应和基本事实数据之间的相关性。我们使用SOTA LALMS在公共音频基准数据集上演示了我们所提出的技术的有效性。

[NLP-25] Advancing Topic Segmentation of Broadcasted Speech with Multilingual Semantic Embeddings
[NLP-25] 利用多语言语义嵌入推进广播语音的主题分割

链接: https://arxiv.org/abs/2409.06222
作者: Sakshi Deo Shukla,Pavel Denisov,Tugtekin Turan
关键词-EN: Recent advancements, capture semantic representations, pretrained speech encoders, semantic representations directly, speech-based topic segmentation
关键词-ZH: 最新进展、捕获语义表示、预训练的语音编码器、直接语义表示、基于语音的主题分割
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Recent advancements in speech-based topic segmentation have highlighted the potential of pretrained speech encoders to capture semantic representations directly from speech. Traditionally, topic segmentation has relied on a pipeline approach in which transcripts of the automatic speech recognition systems are generated, followed by text-based segmentation algorithms. In this paper, we introduce an end-to-end scheme that bypasses this conventional two-step process by directly employing semantic speech encoders for segmentation. Focused on the broadcasted news domain, which poses unique challenges due to the diversity of speakers and topics within single recordings, we address the challenge of accessing topic change points efficiently in an end-to-end manner. Furthermore, we propose a new benchmark for spoken news topic segmentation by utilizing a dataset featuring approximately 1000 hours of publicly available recordings across six European languages and including an evaluation set in Hindi to test the model’s cross-domain performance in a cross-lingual, zero-shot scenario. This setup reflects real-world diversity and the need for models adapting to various linguistic settings. Our results demonstrate that while the traditional pipeline approach achieves a state-of-the-art P_k score of 0.2431 for English, our end-to-end model delivers a competitive P_k score of 0.2564. When trained multilingually, these scores further improve to 0.1988 and 0.2370, respectively. To support further research, we release our model along with data preparation scripts, facilitating open research on multilingual spoken news topic segmentation.
摘要：基于语音的主题分割的最新进展突出了预先训练的语音编码器直接从语音中获取语义表示的潜力。传统上，主题分割依赖于管道方法，在这种方法中，自动语音识别系统的文本被生成，随后是基于文本的分割算法。在本文中，我们介绍了一种端到端的方案，通过直接使用语义语音编码器进行分割，绕过了传统的两步过程。针对广播新闻领域，由于单个录音中演讲者和主题的多样性而带来了独特的挑战，我们解决了以端到端的方式高效访问主题转换点的挑战。此外，我们提出了一种新的口语新闻主题分割基准，通过使用一个包含六种欧洲语言约1000小时公开可用录音的数据集，并包括一个印地语评估集来测试该模型在跨语言、零镜头场景中的跨域性能。这种设置反映了现实世界的多样性，以及对适应各种语言环境的模型的需求。我们的结果表明，传统的管道教学法在英语方面达到了最先进的Pk得分0.2431，而我们的端到端模式提供了0.2564的竞争性Pk得分。当进行多语种训练时，这些分数分别进一步提高到0.1988和0.2370。为了支持进一步的研究，我们发布了我们的模型和数据准备脚本，促进了多语言口语新闻主题切分的开放研究。

[NLP-26] SubRegWeigh: Effective and Efficient Annotation Weighing with Subword Regularization
[NLP-26] SubRegWeigh：通过子字规则化有效且高效的注释加权

链接: https://arxiv.org/abs/2409.06216
作者: Kohei Tsuji,Tatsuya Hiraoka,Yuchang Cheng,Tomoya Iwakura
关键词-EN: natural language processing, language processing, natural language, NLP, include annotation errors
关键词-ZH: 自然语言处理、语言处理、自然语言、NLP，包括注释错误
类目: Computation and Language (cs.CL)
备注: 14 pages, 1 figures, 10 tables

点击查看摘要

Abstract:Many datasets of natural language processing (NLP) sometimes include annotation errors. Researchers have attempted to develop methods to reduce the adverse effect of errors in datasets automatically. However, an existing method is time-consuming because it requires many trained models to detect errors. We propose a novel method to reduce the time of error detection. Specifically, we use a tokenization technique called subword regularization to create pseudo-multiple models which are used to detect errors. Our proposed method, SubRegWeigh, can perform annotation weighting four to five times faster than the existing method. Additionally, SubRegWeigh improved performance in both document classification and named entity recognition tasks. In experiments with pseudo-incorrect labels, pseudo-incorrect labels were adequately detected.
摘要：许多自然语言处理（NLP）数据集有时会包含注释错误。研究人员试图开发自动减少数据集中错误的不利影响的方法。然而，现有的方法很耗时，因为它需要许多经过训练的模型来检测错误。我们提出了一种新颖的方法来减少错误检测的时间。具体来说，我们使用一种称为子字正规化的标记化技术来创建用于检测错误的伪多重模型。我们提出的方法SubRegWeigh执行注释加权的速度比现有方法快四到五倍。此外，SubRegWeigh还提高了文档分类和命名实体识别任务的性能。在伪错误标签的实验中，伪错误标签被充分检测到。

[NLP-27] STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning
[NLP-27] STUN：结构化后非结构化修剪，实现可扩展的MoE修剪

链接: https://arxiv.org/abs/2409.06211
作者: Jaeseong Lee,seung-won hwang,Aurick Qiao,Daniel F Campos,Zhewei Yao,Yuxiong He
关键词-EN: Large language models, reducing inference costs, Large language, sparsely activating experts, pruning
关键词-ZH: 大型语言模型，降低推理成本，大型语言，稀疏激活专家，修剪
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mixture-of-experts (MoEs) have been adopted for reducing inference costs by sparsely activating experts in Large language models (LLMs). Despite this reduction, the massive number of experts in MoEs still makes them expensive to serve. In this paper, we study how to address this, by pruning MoEs. Among pruning methodologies, unstructured pruning has been known to achieve the highest performance for a given pruning ratio, compared to structured pruning, since the latter imposes constraints on the sparsification structure. This is intuitive, as the solution space of unstructured pruning subsumes that of structured pruning. However, our counterintuitive finding reveals that expert pruning, a form of structured pruning, can actually precede unstructured pruning to outperform unstructured-only pruning. As existing expert pruning, requiring O(\frack^n\sqrtn) forward passes for n experts, cannot scale for recent MoEs, we propose a scalable alternative with O(1) complexity, yet outperforming the more expensive methods. The key idea is leveraging a latent structure between experts, based on behavior similarity, such that the greedy decision of whether to prune closely captures the joint pruning effect. Ours is highly effective – for Snowflake Arctic, a 480B-sized MoE with 128 experts, our method needs only one H100 and two hours to achieve nearly no loss in performance with 40% sparsity, even in generative tasks such as GSM8K, where state-of-the-art unstructured pruning fails to. The code will be made publicly available.
摘要：专家混合(MOE)通过稀疏地激活大型语言模型(LLM)中的专家来降低推理成本。尽管减少了，但教育部的大量专家仍然使他们的服务成本很高。在本文中，我们研究如何通过修剪MOE来解决这个问题。在剪枝方法中，与结构化剪枝相比，非结构化剪枝在给定的剪枝比率下获得了最高的性能，因为后者对稀疏结构施加了约束。这是直观的，因为非结构化修剪的解空间包含了结构化修剪的解空间。然而，我们违反直觉的发现显示，专家修剪是结构化修剪的一种形式，实际上可以先于非结构化修剪而优于仅非结构化修剪。由于现有的专家剪枝需要n个专家的O(fRACK^n\SQRTN)次前向遍历，不能扩展到最近的MOE，我们提出了一种具有O(1)复杂度但性能优于更昂贵的方法的可扩展的替代方案。其关键思想是利用专家之间基于行为相似性的潜在结构，以便贪婪地决定是否密切剪枝可以捕捉到联合剪枝的效果。我们的方法非常有效–对于拥有128名专家、480B大小的MoE的Snowflake北极公司，我们的方法只需要一个H100和两个小时就可以实现几乎没有性能损失，具有40%的稀疏性，即使是在GSM8K这样的生成性任务中，最先进的非结构化修剪也无法做到这一点。代码将公之于众。

[NLP-28] SHAPE-IT: Exploring Text-to-Shape-Display for Generative Shape-Changing Behaviors with LLMs
[NLP-28] SHAPE-IT：利用LLM探索文本到形状显示以实现生成性形状改变行为

链接: https://arxiv.org/abs/2409.06205
作者: Wanli Qian,Chenfeng Gao,Anup Sathya,Ryo Suzuki,Ken Nakagaki
关键词-EN: generating dynamic shape, natural language commands, pin-based shape displays, paper introduces, generating dynamic
关键词-ZH: 生成动态形状、自然语言命令、基于pin的形状显示、论文介绍、生成动态
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: Accepted for ACM UIST 2024

点击查看摘要

Abstract:This paper introduces text-to-shape-display, a novel approach to generating dynamic shape changes in pin-based shape displays through natural language commands. By leveraging large language models (LLMs) and AI-chaining, our approach allows users to author shape-changing behaviors on demand through text prompts without programming. We describe the foundational aspects necessary for such a system, including the identification of key generative elements (primitive, animation, and interaction) and design requirements to enhance user interaction, based on formative exploration and iterative design processes. Based on these insights, we develop SHAPE-IT, an LLM-based authoring tool for a 24 x 24 shape display, which translates the user’s textual command into executable code and allows for quick exploration through a web-based control interface. We evaluate the effectiveness of SHAPE-IT in two ways: 1) performance evaluation and 2) user evaluation (N= 10). The study conclusions highlight the ability to facilitate rapid ideation of a wide range of shape-changing behaviors with AI. However, the findings also expose accuracy-related challenges and limitations, prompting further exploration into refining the framework for leveraging AI to better suit the unique requirements of shape-changing systems.
摘要：本文介绍了一种通过自然语言命令在基于引脚的形状显示中生成动态形状变化的新方法–文本到形状显示。通过利用大型语言模型(LLM)和人工智能链接，我们的方法允许用户在无需编程的情况下，通过文本提示按需编写形状改变行为。我们描述了这样一个系统所必需的基本方面，包括识别关键的生成元素(基元、动画和交互)，以及基于形成性探索和迭代设计过程来增强用户交互的设计要求。基于这些见解，我们开发了SHAPE-IT，这是一个基于LLM的创作工具，用于24 x 24形状显示，它将用户的文本命令转换为可执行代码，并允许通过基于Web的控制界面进行快速浏览。我们从两个方面对SHAPE-IT的有效性进行了评估：1)性能评估和2)用户评估(N=10)。研究结论强调了利用人工智能促进快速构思广泛的形状改变行为的能力。然而，这些发现也暴露了与精度相关的挑战和限制，促使人们进一步探索完善利用人工智能的框架，以更好地适应形状变化系统的独特要求。

[NLP-29] NOVI : Chatbot System for University Novice with BERT and LLMs
[NLP-29] NOVI：适合大学新手的聊天机器人系统，具有BERT和LLM

链接: https://arxiv.org/abs/2409.06192
作者: Yoonji Nam,TaeWoong Seo,Gyeongcheol Shin,Sangji Lee,JaeEun Im
关键词-EN: chatbot system based, mitigate the difficulties, university life, chatbot system, system based
关键词-ZH: 基于聊天机器人系统，缓解困难，大学生活，聊天机器人系统，基于系统
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To mitigate the difficulties of university freshmen in adapting to university life, we developed NOVI, a chatbot system based on GPT-4o. This system utilizes post and comment data from SKKU ‘Everytime’, a university community site. Developed using LangChain, NOVI’s performance has been evaluated with a BLEU score, Perplexity score, ROUGE-1 score, ROUGE-2 score, ROUGE-L score and METEOR score. This approach is not only limited to help university freshmen but is also expected to help various people adapting to new environments with different data. This research explores the development and potential application of new educational technology tools, contributing to easier social adaptation for beginners and settling a foundation for future advancement in LLM studies.
摘要：为了缓解大学新生适应大学生活的困难，我们开发了基于GPT-4 o的聊天机器人系统NOVI。该系统利用来自大学社区网站SKKU“Everytime”的帖子和评论数据。NOVI使用LangChain开发，其性能已通过BLEU评分、Perplexity评分、ROUGE-1评分、ROUGE-2评分、ROUGE-L评分和METEOR评分进行评估。这种方法不仅限于帮助大学新生，还有望帮助各种人适应具有不同数据的新环境。这项研究探索了新教育技术工具的开发和潜在应用，有助于初学者更容易适应社会，并为LLM研究的未来发展奠定基础。

[NLP-30] Can Large Language Models Unlock Novel Scientific Research Ideas?
[NLP-30] 大型语言模型能否体现新颖的科学研究理念？

链接: https://arxiv.org/abs/2409.06185
作者: Sandeep Kumar,Tirthankar Ghosal,Vinayak Goyal,Asif Ekbal
关键词-EN: future research ideas, research ideas, Artificial Intelligence, Large Language Models, future research
关键词-ZH: 未来的研究思路，研究思路，人工智能，大型语言模型，未来的研究
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 24 pages, 12 figures, 6 tables

点击查看摘要

Abstract:“An idea is nothing more nor less than a new combination of old elements” (Young, J.W.). The widespread adoption of Large Language Models (LLMs) and publicly available ChatGPT have marked a significant turning point in the integration of Artificial Intelligence (AI) into people’s everyday lives. This study explores the capability of LLMs in generating novel research ideas based on information from research papers. We conduct a thorough examination of 4 LLMs in five domains (e.g., Chemistry, Computer, Economics, Medical, and Physics). We found that the future research ideas generated by Claude-2 and GPT-4 are more aligned with the author’s perspective than GPT-3.5 and Gemini. We also found that Claude-2 generates more diverse future research ideas than GPT-4, GPT-3.5, and Gemini 1.0. We further performed a human evaluation of the novelty, relevancy, and feasibility of the generated future research ideas. This investigation offers insights into the evolving role of LLMs in idea generation, highlighting both its capability and limitations. Our work contributes to the ongoing efforts in evaluating and utilizing language models for generating future research ideas. We make our datasets and codes publicly available.
摘要：“一个想法只不过是旧元素的新组合”(Young，J.W.)。大型语言模型(LLM)的广泛采用和公开可用的ChatGPT标志着人工智能(AI)融入人们日常生活的一个重要转折点。这项研究探索了LLMS基于研究论文的信息产生新的研究想法的能力。我们对五个领域(例如，化学、计算机、经济、医学和物理)中的4个LLM进行了彻底的考试。我们发现，克劳德-2和GPT-4产生的未来研究思路比GPT-3.5和Gemini更符合作者的视角。我们还发现，克劳德-2比GPT-4、GPT-3.5和Gemini 1.0产生了更多样化的未来研究想法。我们进一步对所产生的未来研究想法的新颖性、相关性和可行性进行了人工评估。这项调查提供了对LLMS在创意生成中不断演变的角色的洞察，突出了它的能力和局限性。我们的工作有助于评估和利用语言模型来产生未来的研究想法。我们公开我们的数据集和代码。

[NLP-31] SQLucid: Grounding Natural Language Database Queries with Interactive Explanations
[NLP-31] SQLucid：通过交互式解释为自然语言数据库收件箱奠定基础

链接: https://arxiv.org/abs/2409.06178
作者: Yuan Tian,Jonathan K. Kummerfeld,Toby Jia-Jun Li,Tianyi Zhang
关键词-EN: systems remain limited, remain limited, high-stakes domains, recent advances, advances in machine
关键词-ZH: 系统仍然有限，仍然有限，高风险领域，最新进展，机器的进步
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: Accepted to UIST’24

点击查看摘要

Abstract:Though recent advances in machine learning have led to significant improvements in natural language interfaces for databases, the accuracy and reliability of these systems remain limited, especially in high-stakes domains. This paper introduces SQLucid, a novel user interface that bridges the gap between non-expert users and complex database querying processes. SQLucid addresses existing limitations by integrating visual correspondence, intermediate query results, and editable step-by-step SQL explanations in natural language to facilitate user understanding and engagement. This unique blend of features empowers users to understand and refine SQL queries easily and precisely. Two user studies and one quantitative experiment were conducted to validate SQLucid’s effectiveness, showing significant improvement in task completion accuracy and user confidence compared to existing interfaces. Our code is available at this https URL.
摘要：尽管机器学习的最新进展导致数据库自然语言界面的显着改进，但这些系统的准确性和可靠性仍然有限，尤其是在高风险领域。本文介绍了SQLucid，这是一种新型用户界面，可以弥合非专家用户和复杂数据库查询流程之间的差距。SQLucid通过集成视觉通信、中间查询结果和自然语言可编辑的分步SQL解释来解决现有的局限性，以促进用户理解和参与。这种独特的功能组合使用户能够轻松、准确地理解和改进SQL查询。进行了两项用户研究和一项定量实验来验证SQLucid的有效性，结果显示与现有界面相比，任务完成准确性和用户信心有显着提高。我们的代码可在此https URL上找到。

[NLP-32] Larger Language Models Dont Care How You Think: Why Chain-of-Thought Prompting Fails in Subjective Tasks
[NLP-32] 更大的语言模型不在乎你如何思考：为什么思想链预测在主观任务中失败

链接: https://arxiv.org/abs/2409.06173
作者: Georgios Chochlakis,Niyantha Maruthu Pandiyan,Kristina Lerman,Shrikanth Narayanan
关键词-EN: Large Language Models, performing natural language, natural language tasks, Large Language, gradient-based methods
关键词-ZH: 大型语言模型，执行自然语言，自然语言任务，大型语言，基于梯度的方法
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 2 figures, 1 table

点击查看摘要

Abstract:In-Context Learning (ICL) in Large Language Models (LLM) has emerged as the dominant technique for performing natural language tasks, as it does not require updating the model parameters with gradient-based methods. ICL promises to “adapt” the LLM to perform the present task at a competitive or state-of-the-art level at a fraction of the computational cost. ICL can be augmented by incorporating the reasoning process to arrive at the final label explicitly in the prompt, a technique called Chain-of-Thought (CoT) prompting. However, recent work has found that ICL relies mostly on the retrieval of task priors and less so on “learning” to perform tasks, especially for complex subjective domains like emotion and morality, where priors ossify posterior predictions. In this work, we examine whether “enabling” reasoning also creates the same behavior in LLMs, wherein the format of CoT retrieves reasoning priors that remain relatively unchanged despite the evidence in the prompt. We find that, surprisingly, CoT indeed suffers from the same posterior collapse as ICL for larger language models. Code is avalaible at this https URL.
摘要：大语言模型中的上下文中学习(ICL)由于不需要使用基于梯度的方法来更新模型参数而成为执行自然语言任务的主要技术。ICL承诺“调整”LLM，使其在具有竞争力或最先进的水平上执行目前的任务，而计算成本只有它的一小部分。ICL可以通过合并推理过程来增强，以明确地在提示中得出最终标签，这是一种称为思想链(COT)提示的技术。然而，最近的研究发现，ICL主要依赖于对任务先验的提取，而较少依赖“学习”来执行任务，特别是对于复杂的主观领域，如情绪和道德，先验和后验预测是僵化的。在这项工作中，我们研究了“使能”推理是否也在LLMS中产生了相同的行为，其中COT的格式检索了相对保持不变的推理先验，尽管提示中有证据。我们发现，令人惊讶的是，对于较大的语言模型，COT确实遭受了与ICL相同的后验崩溃。代码可在此HTTPS URL上获得。

[NLP-33] Deep Learning and Large Language Models for Audio and Text Analysis in Predicting Suicidal Acts in Chinese Psychological Support Hotlines
[NLP-33] 深度学习和大型语言模型用于音频和文本分析，预测中国心理支持热线中的自杀行为

链接: https://arxiv.org/abs/2409.06164
作者: Yining Chen,Jianqiang Li,Changwei Song,Qing Zhao,Yongsheng Tong,Guanghui Fu
关键词-EN: pressing global issue, effective preventive interventions, global issue, demanding urgent, pressing global
关键词-ZH: 紧迫的全球问题，有效的预防干预措施，全球问题，要求紧急，紧迫的全球性
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Suicide is a pressing global issue, demanding urgent and effective preventive interventions. Among the various strategies in place, psychological support hotlines had proved as a potent intervention method. Approximately two million people in China attempt suicide annually, with many individuals making multiple attempts. Prompt identification and intervention for high-risk individuals are crucial to preventing tragedies. With the rapid advancement of artificial intelligence (AI), especially the development of large-scale language models (LLMs), new technological tools have been introduced to the field of mental health. This study included 1284 subjects, and was designed to validate whether deep learning models and LLMs, using audio and transcribed text from support hotlines, can effectively predict suicide risk. We proposed a simple LLM-based pipeline that first summarizes transcribed text from approximately one hour of speech to extract key features, and then predict suicidial bahaviours in the future. We compared our LLM-based method with the traditional manual scale approach in a clinical setting and with five advanced deep learning models. Surprisingly, the proposed simple LLM pipeline achieved strong performance on a test set of 46 subjects, with an F1 score of 76% when combined with manual scale rating. This is 7% higher than the best speech-based deep learning models and represents a 27.82% point improvement in F1 score compared to using the manual scale apporach alone. Our study explores new applications of LLMs and demonstrates their potential for future use in suicide prevention efforts.
摘要：自杀是一个紧迫的全球性问题，迫切需要有效的预防性干预。在现有的各种策略中，心理支持热线已被证明是一种有效的干预方法。中国每年约有200万人试图自杀，其中许多人多次试图自杀。及时识别和干预高危人群是防止悲剧发生的关键。随着人工智能(AI)的快速发展，特别是大规模语言模型(LLMS)的发展，心理健康领域引入了新的技术工具。这项研究包括1284名受试者，旨在验证深度学习模型和LLMS，使用来自支持热线的音频和转录文本，是否可以有效预测自杀风险。我们提出了一个简单的基于LLM的流水线，它首先从大约一个小时的语音中总结转录的文本来提取关键特征，然后预测未来的自杀行为。我们在临床环境和五个高级深度学习模型中比较了我们的基于LLM的方法和传统的手动标尺方法。令人惊讶的是，拟议的Simple LLM管道在包含46名受试者的测试集上取得了强劲的表现，当与手动量表评分相结合时，F1得分为76%。这比最好的基于语音的深度学习模型高出7分，与仅使用手动比例表相比，F1分数提高了27.82分。我们的研究探索了LLMS的新应用，并展示了它们在未来自杀预防工作中的潜力。

[NLP-34] Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn Focus and Review
[NLP-34] 通过LFR教学法加速大型语言模型预训练：学习重点和复习

链接: https://arxiv.org/abs/2409.06131
作者: Neha Prakriya,Jui-Nan Yen,Cho-Jui Hsieh,Jason Cong
关键词-EN: Large Language Model, randomly sampled data, pretraining traditionally relies, Large Language, sampled data blocks
关键词-ZH: 大型语言模型、随机抽样数据、传统依赖的预训练、大型语言、抽样数据块
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) pretraining traditionally relies on autoregressive language modeling on randomly sampled data blocks from web-scale datasets. We take inspiration from human learning techniques like spaced repetition to hypothesize that random data sampling for LLMs leads to high training cost and low quality models which tend to forget data. In order to effectively commit web-scale information to long-term memory, we propose the LFR (Learn, Focus, and Review) pedagogy, a new dynamic training paradigm which focuses and repeatedly reviews complex data blocks at systematic intervals based on the model’s learning pace and progress. LFR records the model perplexities for different data blocks and frequently revisits blocks with higher perplexity which are more likely to be forgotten. We pretrain the GPT-2 models (124M - 1.5B) from scratch on the OpenWebText dataset using LFR. We test on downstream tasks from the language modeling, question answering, translation, and problem solving domains to achieve consistently lower perplexity and higher accuracy than the baseline OpenAI models, while obtaining a 20x pretraining speed-up.
摘要：大语言模型(LLM)的预训练传统上依赖于从网络规模数据集中随机抽样的数据块上的自回归语言建模。我们从间隔重复等人类学习技术中获得灵感，假设LLMS的随机数据采样会导致高训练成本和低质量的模型，从而容易忘记数据。为了有效地将网络规模的信息用于长期记忆，我们提出了LFR(学习、聚焦和回顾)教学方法，这是一种新的动态训练范式，它基于模型的学习速度和进度，以系统的间隔聚焦并重复审查复杂的数据块。LFR记录了不同数据块的模型困惑程度，并频繁地重复访问困惑程度较高的易被遗忘的块。我们使用LFR在OpenWebText数据集上从头开始对GPT-2模型(124M-1.5B)进行预训练。我们对语言建模、问题回答、翻译和问题解决领域的下游任务进行了测试，以实现比基线OpenAI模型更低的困惑和更高的精度，同时获得20倍的预训练加速。

[NLP-35] Estimating the Completeness of Discrete Speech Units
[NLP-35] 估计离散语音单元的完整性

链接: https://arxiv.org/abs/2409.06109
作者: Sung-Lin Yeh,Hao Tang
关键词-EN: Representing speech, speech generation, speech codec, discrete units, information
关键词-ZH: 表示语音、语音生成、语音编解码器、离散单元、信息
类目: Computation and Language (cs.CL)
备注: SLT2024

点击查看摘要

Abstract:Representing speech with discrete units has been widely used in speech codec and speech generation. However, there are several unverified claims about self-supervised discrete units, such as disentangling phonetic and speaker information with k-means, or assuming information loss after k-means. In this work, we take an information-theoretic perspective to answer how much information is present (information completeness) and how much information is accessible (information accessibility), before and after residual vector quantization. We show a lower bound for information completeness and estimate completeness on discretized HuBERT representations after residual vector quantization. We find that speaker information is sufficiently present in HuBERT discrete units, and that phonetic information is sufficiently present in the residual, showing that vector quantization does not achieve disentanglement. Our results offer a comprehensive assessment on the choice of discrete units, and suggest that a lot more information in the residual should be mined rather than discarded.
摘要：离散单元表示语音在语音编解码和语音生成中得到了广泛的应用。然而，有几个关于自我监督离散单元的未经证实的说法，例如用k-Means来解开语音和说话人信息的纠缠，或者假设k-Means之后的信息丢失。在这项工作中，我们从信息论的角度来回答残差矢量量化前后存在多少信息(信息完备性)和多少信息是可访问的(信息可访问性)。我们给出了信息完备性的一个下界，并估计了剩余矢量量化后离散化的Hubert表示的完备性。我们发现，说话人信息在休伯特离散单元中充分存在，语音信息在残差中充分存在，这表明矢量量化并不能实现解缠。我们的结果对离散单元的选择提供了一个全面的评估，并建议应该挖掘而不是丢弃残差中的更多信息。

[NLP-36] Doppelg"angers Watch: A Split Objective Approach to Large Language Models
[NLP-36] Doppelg“愤怒观察：大型语言模型的分离目标方法

链接: https://arxiv.org/abs/2409.06107
作者: Shervin Ghasemlou,Ashish Katiyar,Aparajita Saraf,Seungwhan Moon,Mangesh Pujari,Pinar Donmez,Babak Damavandi,Anuj Kumar
关键词-EN: large language models, separate supervision signals, underlying language model, core capability, investigate the problem
关键词-ZH: 大型语言模型、单独的监督信号、底层语言模型、核心能力、调查问题
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we investigate the problem of “generation supervision” in large language models, and present a novel bicameral architecture to separate supervision signals from their core capability, helpfulness. Doppelgänger, a new module parallel to the underlying language model, supervises the generation of each token, and learns to concurrently predict the supervision score(s) of the sequences up to and including each token. In this work, we present the theoretical findings, and leave the report on experimental results to a forthcoming publication.
摘要：本文研究了大型语言模型中的“生成监督”问题，并提出了一种新颖的两院制架构，将监督信号与其核心能力（帮助性）分开。Doppelgänger是一个与底层语言模型并行的新模块，它监督每个令牌的生成，并学习并发预测直到每个令牌的序列的监督分数。在这项工作中，我们介绍了理论发现，并将实验结果报告留给即将出版的出版物。

[NLP-37] ClarQ-LLM: A Benchmark for Models Clarifying and Requesting Information in Task-Oriented Dialog
[NLP-37] ClarQ-LLM：在面向任务的对话中澄清和请求信息的模型基准

链接: https://arxiv.org/abs/2409.06097
作者: Yujian Gan,Changling Li,Jinxia Xie,Luou Wen,Matthew Purver,Massimo Poesio
关键词-EN: evaluation framework consisting, bilingual English-Chinese conversation, English-Chinese conversation tasks, assessing agents’ ability, evaluation metrics
关键词-ZH: 评估框架，包括双语对话、中英对话任务、评估代理人能力、评估指标
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce ClarQ-LLM, an evaluation framework consisting of bilingual English-Chinese conversation tasks, conversational agents and evaluation metrics, designed to serve as a strong benchmark for assessing agents’ ability to ask clarification questions in task-oriented dialogues. The benchmark includes 31 different task types, each with 10 unique dialogue scenarios between information seeker and provider agents. The scenarios require the seeker to ask questions to resolve uncertainty and gather necessary information to complete tasks. Unlike traditional benchmarks that evaluate agents based on fixed dialogue content, ClarQ-LLM includes a provider conversational agent to replicate the original human provider in the benchmark. This allows both current and future seeker agents to test their ability to complete information gathering tasks through dialogue by directly interacting with our provider agent. In tests, LLAMA3.1 405B seeker agent managed a maximum success rate of only 60.05%, showing that ClarQ-LLM presents a strong challenge for future research.
摘要：我们介绍了由英汉双语会话任务、会话主体和评价指标组成的评估框架ClarQ-LLM，旨在作为评估主体在任务型对话中提出澄清问题能力的有力基准。该基准包括31种不同的任务类型，每种类型都有10个独特的信息寻求者和提供者代理之间的对话场景。这些场景要求探索者提出问题来解决不确定性，并收集必要的信息来完成任务。与基于固定对话内容评估代理的传统基准不同，ClarQ-LLM包括提供者对话代理，以在基准中复制原始人类提供者。这允许当前和未来的搜索者代理通过与提供商代理直接交互来测试他们通过对话完成信息收集任务的能力。在测试中，LLAMA3.1405B搜索者代理的最大成功率仅为60.05%，这表明ClarQ-LLM对未来的研究提出了强大的挑战。

[NLP-38] DetoxBench: Benchmarking Large Language Models for Multitask Fraud Abuse Detection
[NLP-38] DetoxBench：针对多任务欺诈滥用检测的大型语言模型进行基准测试

链接: https://arxiv.org/abs/2409.06072
作者: Joymallya Chakraborty,Wei Xia,Anirban Majumder,Dan Ma,Walid Chaabene,Naveed Janvekar
关键词-EN: Large language models, demonstrated remarkable capabilities, natural language processing, Large language, demonstrated remarkable
关键词-ZH: 大型语言模型，表现出非凡的能力，自然语言处理，大型语言，表现出非凡的能力
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 12 pages

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks. However, their practical application in high-stake domains, such as fraud and abuse detection, remains an area that requires further exploration. The existing applications often narrowly focus on specific tasks like toxicity or hate speech detection. In this paper, we present a comprehensive benchmark suite designed to assess the performance of LLMs in identifying and mitigating fraudulent and abusive language across various real-world scenarios. Our benchmark encompasses a diverse set of tasks, including detecting spam emails, hate speech, misogynistic language, and more. We evaluated several state-of-the-art LLMs, including models from Anthropic, Mistral AI, and the AI21 family, to provide a comprehensive assessment of their capabilities in this critical domain. The results indicate that while LLMs exhibit proficient baseline performance in individual fraud and abuse detection tasks, their performance varies considerably across tasks, particularly struggling with tasks that demand nuanced pragmatic reasoning, such as identifying diverse forms of misogynistic language. These findings have important implications for the responsible development and deployment of LLMs in high-risk applications. Our benchmark suite can serve as a tool for researchers and practitioners to systematically evaluate LLMs for multi-task fraud detection and drive the creation of more robust, trustworthy, and ethically-aligned systems for fraud and abuse detection.
摘要：大型语言模型在自然语言处理任务中表现出了卓越的性能。然而，它们在高风险领域的实际应用，如欺诈和滥用检测，仍然是一个需要进一步探索的领域。现有的应用程序往往狭隘地专注于特定的任务，如毒性或仇恨语音检测。在本文中，我们提出了一个全面的基准测试套件，旨在评估LLMS在识别和减少各种真实世界场景中的欺诈性和滥用语言方面的性能。我们的基准包括一系列不同的任务，包括检测垃圾电子邮件、仇恨言论、厌女症语言等。我们评估了几个最先进的LLM，包括来自人形、Mistral AI和AI21系列的模型，以提供对它们在这一关键领域的能力的全面评估。结果表明，尽管LLM在个体欺诈和虐待检测任务中表现出熟练的基线表现，但他们的表现在不同任务中差异很大，特别是在需要细微差别的语用推理的任务中，例如识别各种形式的厌女症语言。这些发现对于在高风险应用中负责任地开发和部署低成本管理具有重要意义。我们的基准套件可以作为研究人员和从业者的工具，系统地评估用于多任务欺诈检测的LLMS，并推动创建更强大、更值得信赖和符合道德规范的欺诈和滥用检测系统。

[NLP-39] Identifying the sources of ideological bias in GPT models through linguistic variation in output
[NLP-39] 通过输出的语言差异识别GPT模型中意识形态偏见的来源

链接: https://arxiv.org/abs/2409.06043
作者: Christina Walker,Joan C. Timoneda
关键词-EN: Extant work shows, perpetuate social stereotypes, Extant work, perpetuate social, work shows
关键词-ZH: 现存的工作节目，延续社会刻板印象，现存的工作，延续社会，延续工作节目
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Extant work shows that generative AI models such as GPT-3.5 and 4 perpetuate social stereotypes and biases. One concerning but less explored source of bias is ideology. Do GPT models take ideological stances on politically sensitive topics? In this article, we provide an original approach to identifying ideological bias in generative models, showing that bias can stem from both the training data and the filtering algorithm. We leverage linguistic variation in countries with contrasting political attitudes to evaluate bias in average GPT responses to sensitive political topics in those languages. First, we find that GPT output is more conservative in languages that map well onto conservative societies (i.e., Polish), and more liberal in languages used uniquely in liberal societies (i.e., Swedish). This result provides strong evidence of training data bias in GPT models. Second, differences across languages observed in GPT-3.5 persist in GPT-4, even though GPT-4 is significantly more liberal due to OpenAI’s filtering policy. Our main takeaway is that generative model training must focus on high-quality, curated datasets to reduce bias, even if it entails a compromise in training data size. Filtering responses after training only introduces new biases and does not remove the underlying training biases.
摘要：现有的研究表明，GPT-3.5和4等生成性人工智能模型会使社会刻板印象和偏见永久化。一个令人担忧但较少被探讨的偏见来源是意识形态。GPT模型在政治敏感话题上采取意识形态立场吗？在这篇文章中，我们提供了一种新颖的方法来识别生成模型中的意识形态偏差，表明偏差可以源于训练数据和过滤算法。我们利用政治态度不同的国家的语言差异来评估GPT对这些语言敏感政治话题的平均回答的偏见。首先，我们发现，GPT输出在映射到保守社会的语言(即波兰语)中更保守，在自由社会中唯一使用的语言(即瑞典语)中更自由。这一结果为GPT模型中的训练数据偏差提供了强有力的证据。其次，在GPT-3.5中观察到的跨语言差异在GPT-4中仍然存在，尽管由于OpenAI的过滤策略，GPT-4明显更加自由。我们的主要结论是，生成性模型训练必须专注于高质量的精选数据集，以减少偏见，即使它需要在训练数据大小方面做出妥协。训练后过滤回答只会带来新的偏差，而不会消除潜在的训练偏差。

[NLP-40] Investigating Causal Cues: Strengthening Spoofed Audio Detection with Human-Discernible Linguistic Features
[NLP-40] 调查因果线索：利用人类可辨别的语言特征加强欺骗音频检测

链接: https://arxiv.org/abs/2409.06033
作者: Zahra Khanjani,Tolulope Ale,Jianwu Wang,Lavon Davis,Christine Mallinson,Vandana P. Janeja
关键词-EN: created societal challenges, spoofed audio, Defined Linguistic Features, Expert Defined Linguistic, discern spoofed audio
关键词-ZH: 创造了社会挑战、欺骗音频、定义的语言特征、专家定义的语言、辨别欺骗音频
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Several types of spoofed audio, such as mimicry, replay attacks, and deepfakes, have created societal challenges to information integrity. Recently, researchers have worked with sociolinguistics experts to label spoofed audio samples with Expert Defined Linguistic Features (EDLFs) that can be discerned by the human ear: pitch, pause, word-initial and word-final release bursts of consonant stops, audible intake or outtake of breath, and overall audio quality. It is established that there is an improvement in several deepfake detection algorithms when they augmented the traditional and common features of audio data with these EDLFs. In this paper, using a hybrid dataset comprised of multiple types of spoofed audio augmented with sociolinguistic annotations, we investigate causal discovery and inferences between the discernible linguistic features and the label in the audio clips, comparing the findings of the causal models with the expert ground truth validation labeling process. Our findings suggest that the causal models indicate the utility of incorporating linguistic features to help discern spoofed audio, as well as the overall need and opportunity to incorporate human knowledge into models and techniques for strengthening AI models. The causal discovery and inference can be used as a foundation of training humans to discern spoofed audio as well as automating EDLFs labeling for the purpose of performance improvement of the common AI-based spoofed audio detectors.
摘要：几种类型的欺骗音频，如模仿、重放攻击和深度伪造，给信息完整性带来了社会挑战。最近，研究人员与社会语言学专家合作，用人类耳朵可以识别的专家定义的语言特征(EDFL)来标记假冒的音频样本，这些特征包括：音调、停顿、辅音停顿的词首和词尾释放、可听到的吸气或呼出，以及整体音频质量。研究表明，几种深度伪检测算法在增强音频数据的传统特征和常见特征时，都有一定的改进。在本文中，我们使用一个由多种类型的带有社会语言学注释的恶搞音频组成的混合数据集，研究可辨别的语言特征与音频片段中的标签之间的因果发现和推理，并将因果模型的结果与专家基础真值验证标注过程进行比较。我们的发现表明，因果模型表明了结合语言特征来帮助识别假冒音频的效用，以及将人类知识纳入模型和加强人工智能模型的技术中的总体需求和机会。这种因果发现和推理可以作为训练人类识别欺骗音频以及自动标记EDFL的基础，目的是提高常见的基于人工智能的欺骗音频检测器的性能。

[NLP-41] Improved Visually Prompted Keyword Localisation in Real Low-Resource Settings
[NLP-41] 在真正的低资源环境中改进视觉化关键字本地化

链接: https://arxiv.org/abs/2409.06013
作者: Leanne Nortje,Dan Oneata,Herman Kamper
关键词-EN: prompted keyword localisation, visually prompted keyword, keyword localisation, aims to find, prompted keyword
关键词-ZH: 提示关键字本地化，视觉提示关键字，关键字本地化，旨在查找，提示关键字
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Given an image query, visually prompted keyword localisation (VPKL) aims to find occurrences of the depicted word in a speech collection. This can be useful when transcriptions are not available for a low-resource language (e.g. if it is unwritten). Previous work showed that VPKL can be performed with a visually grounded speech model trained on paired images and unlabelled speech. But all experiments were done on English. Moreover, transcriptions were used to get positive and negative pairs for the contrastive loss. This paper introduces a few-shot learning scheme to mine pairs automatically without transcriptions. On English, this results in only a small drop in performance. We also - for the first time - consider VPKL on a real low-resource language, Yoruba. While scores are reasonable, here we see a bigger drop in performance compared to using ground truth pairs because the mining is less accurate in Yoruba.
摘要：给定图像查询，视觉提示关键字本地化（VPKL）旨在查找语音集中所描述单词的出现情况。当转录不适用于低资源语言（例如，如果它是未编写的）时，这可能很有用。之前的工作表明，VPKL可以使用在配对图像和未标记语音上训练的视觉基础语音模型来执行。但所有实验都是在英语上完成的。此外，使用转录来获得对比损失的正和负对。本文引入了一种少镜头学习方案，可以在无需转录的情况下自动挖掘对。在英语方面，这只会导致表现略有下降。我们还首次考虑在真正的低资源语言Yoruba上使用VPKL。虽然分数是合理的，但与使用地面真值对相比，我们看到性能下降更大，因为约鲁巴的挖掘不太准确。

[NLP-42] ransformerRanker: A Tool for Efficiently Finding the Best-Suited Language Models for Downstream Classification Tasks
[NLP-42] ransformerRanker：一种用于有效寻找最适合下游分类任务的语言模型的工具

链接: https://arxiv.org/abs/2409.05997
作者: Lukas Garbas,Max Ploner,Alan Akbik
关键词-EN: pre-trained language model, NLP are typically, model hub, language model, typically addressed
关键词-ZH: 预训练的语言模型，NLP通常是模型中心，语言模型，通常被解决
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Classification tasks in NLP are typically addressed by selecting a pre-trained language model (PLM) from a model hub, and fine-tuning it for the task at hand. However, given the very large number of PLMs that are currently available, a practical challenge is to determine which of them will perform best for a specific downstream task. With this paper, we introduce TransformerRanker, a lightweight library that efficiently ranks PLMs for classification tasks without the need for computationally costly fine-tuning. Our library implements current approaches for transferability estimation (LogME, H-Score, kNN), in combination with layer aggregation options, which we empirically showed to yield state-of-the-art rankings of PLMs (Garbas et al., 2024). We designed the interface to be lightweight and easy to use, allowing users to directly connect to the HuggingFace Transformers and Dataset libraries. Users need only select a downstream classification task and a list of PLMs to create a ranking of likely best-suited PLMs for their task. We make TransformerRanker available as a pip-installable open-source library this https URL.
摘要：自然语言处理中的分类任务通常是通过从模型中心选择一个预先训练的语言模型(PLM)，并针对手头的任务进行微调来解决的。然而，鉴于目前可用的PLM数量非常多，一个实际的挑战是确定它们中的哪一个对于特定的下游任务执行得最好。在这篇文章中，我们介绍了TransformerRanker，这是一个轻量级的库，它为分类任务有效地对PLM进行排序，而不需要计算代价高昂的微调。我们的库实现了当前的可转移性估计方法(LogME、H-Score、KNN)，并结合层聚合选项，我们经验地表明，这些选项产生了PLM的最先进排名(Garbas等人，2024年)。我们将界面设计为轻量级且易于使用，允许用户直接连接到HuggingFace Transformers和数据集库。用户只需选择下游分类任务和PLM列表，即可为其任务创建可能最适合的PLM的排名。我们将TransformerRanker作为可通过pip安装的开源库提供给该HTTPS URL。

[NLP-43] MessIRve: A Large-Scale Spanish Information Retrieval Dataset
[NLP-43] MessIRve：大规模西班牙信息检索数据集

链接: https://arxiv.org/abs/2409.05994
作者: Francisco Valentini,Viviana Cotik,Damián Furman,Ivan Bercovich,Edgar Altszyler,Juan Manuel Pérez
关键词-EN: user query, finding relevant documents, Spanish, task of finding, Information retrieval
关键词-ZH: 用户查询、查找相关文档、西班牙语、查找任务、信息检索
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Information retrieval (IR) is the task of finding relevant documents in response to a user query. Although Spanish is the second most spoken native language, current IR benchmarks lack Spanish data, hindering the development of information access tools for Spanish speakers. We introduce MessIRve, a large-scale Spanish IR dataset with around 730 thousand queries from Google’s autocomplete API and relevant documents sourced from Wikipedia. MessIRve’s queries reflect diverse Spanish-speaking regions, unlike other datasets that are translated from English or do not consider dialectal variations. The large size of the dataset allows it to cover a wide variety of topics, unlike smaller datasets. We provide a comprehensive description of the dataset, comparisons with existing datasets, and baseline evaluations of prominent IR models. Our contributions aim to advance Spanish IR research and improve information access for Spanish speakers.
摘要：信息检索（IR）是响应用户查询查找相关文档的任务。尽管西班牙语是第二大母语，但当前的IR基准缺乏西班牙语数据，阻碍了为西班牙语使用者开发信息访问工具。我们介绍MessIRve，这是一个大型西班牙IR数据集，包含来自Google自动完成API的约73万个查询以及来自维基百科的相关文档。MessIRve的查询反映了不同的西班牙语地区，与其他从英语翻译或不考虑方言差异的数据集不同。与较小的数据集不同，数据集的大规模使其能够涵盖广泛的主题。我们提供了数据集的全面描述、与现有数据集的比较以及突出IR模型的基线评估。我们的贡献旨在推进西班牙语IR研究并改善西班牙语使用者的信息获取。

[NLP-44] AI for Mathematics Mathematical Formalized Problem Solving and Theorem Proving in Different Fields in Lean4
[NLP-44] 数学人工智能Lean 4中不同领域的数学形式化问题求解和定理证明

链接: https://arxiv.org/abs/2409.05977
作者: Xichen Tang
关键词-EN: verifiable formal languages, significant impact, prove mathematical theorems, computerized verifiable formal, Large Language Models
关键词-ZH: 可验证形式语言，重大影响，证明数学定理，计算机化可验证形式，大型语言模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Using computerized verifiable formal languages like Lean 4 to prove mathematical theorems has a significant impact on mathematical formalization. Lean 4 offers prominent potential for advancing mathematical reasoning. However, existing efforts are limited to mathematical formalization languages in substantial online corpora and are dedicated to keeping pace with rapidly evolving languages. To bridge the gap between the traditional and computerized proof, my approach to formalizing theorem proving involves generating formal steps and complete proofs using Large Language Models (LLMs) based on Natural Language (NL) proofs. The method is to introduce the basic structure and tactics in general, determine how AI can assist the mathematical formalization process to improve its performance, and give examples of solving problems in Lean 4 comparing to NL, mainly in IMO, and a sample theorem proving in abstract algebra.
摘要：使用像Lean 4这样的计算机可验证形式语言来证明数学定理对数学形式化有显着影响。精益4为推进数学推理提供了巨大的潜力。然而，现有的工作仅限于大量在线数据库中的数学形式化语言，并致力于跟上快速发展的语言的步伐。为了弥合传统证明和计算机证明之间的差距，我形式化定理证明的方法包括使用基于自然语言（NL）证明的大型语言模型（LLM）生成形式步骤和完整证明。该方法是总体介绍基本结构和策略，确定人工智能如何协助数学形式化过程提高其性能，并给出Lean 4中与NL（主要是在IMO中）相比解决问题的例子，以及抽象代数中的示例定理证明。

[NLP-45] A Small Claims Court for the NLP: Judging Legal Text Classification Strategies With Small Datasets
[NLP-45] NLP小额索赔法庭：用小数据集判断法律文本分类策略

链接: https://arxiv.org/abs/2409.05972
作者: Mariana Yukari Noguti,Edduardo Vellasques,Luiz Eduardo Soares Oliveira
关键词-EN: Recent advances, text classification tasks, labelled data, modelling has significantly, significantly decreased
关键词-ZH: 最近的进展、文本分类任务、标签数据、建模都显着减少
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in language modelling has significantly decreased the need of labelled data in text classification tasks. Transformer-based models, pre-trained on unlabeled data, can outmatch the performance of models trained from scratch for each task. However, the amount of labelled data need to fine-tune such type of model is still considerably high for domains requiring expert-level annotators, like the legal domain. This paper investigates the best strategies for optimizing the use of a small labeled dataset and large amounts of unlabeled data and perform a classification task in the legal area with 50 predefined topics. More specifically, we use the records of demands to a Brazilian Public Prosecutor’s Office aiming to assign the descriptions in one of the subjects, which currently demands deep legal knowledge for manual filling. The task of optimizing the performance of classifiers in this scenario is especially challenging, given the low amount of resources available regarding the Portuguese language, especially in the legal domain. Our results demonstrate that classic supervised models such as logistic regression and SVM and the ensembles random forest and gradient boosting achieve better performance along with embeddings extracted with word2vec when compared to BERT language model. The latter demonstrates superior performance in association with the architecture of the model itself as a classifier, having surpassed all previous models in that regard. The best result was obtained with Unsupervised Data Augmentation (UDA), which jointly uses BERT, data augmentation, and strategies of semi-supervised learning, with an accuracy of 80.7% in the aforementioned task.
摘要：语言建模的最新进展大大减少了文本分类任务中对标记数据的需求。基于变压器的模型，在未标记的数据上进行预训练，可以在每项任务中超过从头开始训练的模型的性能。然而，对于需要专家级注释器的领域，如法律领域，微调这种类型的模型所需的标签数据量仍然相当高。本文研究了优化使用小的已标记数据集和大量未标记数据的最佳策略，并在法律领域执行了具有50个预定义主题的分类任务。更具体地说，我们将诉求记录交给巴西检察官办公室，目的是指定其中一个主题的描述，目前这一主题需要深厚的法律知识才能手动填写。鉴于葡萄牙语的可用资源很少，特别是在法律领域，在这种情况下优化分类器的性能的任务尤其具有挑战性。实验结果表明，与BERT语言模型相比，Logistic回归和支持向量机等经典的监督模型以及随机森林和梯度提升等经典监督模型在嵌入word2vec的情况下取得了更好的性能。后者在作为分类器的模型本身的体系结构方面表现出了优越的性能，在这方面超过了所有以前的模型。其中，无监督数据增强(UDA)方法效果最好，它综合运用了误比特法、数据增强和半监督学习策略，在上述任务中的准确率为80.7%。

[NLP-46] Assessing SPARQL capabilities of Large Language Models
[NLP-46] 评估大型语言模型的SPARQL能力

链接: https://arxiv.org/abs/2409.05925
作者: Lars-Peter Meyer,Johannes Frey,Felix Brei,Natanael Arndt
关键词-EN: Large Language Models, offers significant synergistic, significant synergistic potential, SPARQL SELECT queries, Large Language
关键词-ZH: 大型语言模型，提供显着的协同、显着的协同潜力、SPARQL SELECT查询、大型语言
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: peer reviewed publication at NLP4KGc @ Semantics 2024, see this https URL

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) with Knowledge Graphs (KGs) offers significant synergistic potential for knowledge-driven applications. One possible integration is the interpretation and generation of formal languages, such as those used in the Semantic Web, with SPARQL being a core technology for accessing KGs. In this paper, we focus on measuring out-of-the box capabilities of LLMs to work with SPARQL and more specifically with SPARQL SELECT queries applying a quantitative approach. We implemented various benchmarking tasks in the LLM-KG-Bench framework for automated execution and evaluation with several LLMs. The tasks assess capabilities along the dimensions of syntax, semantic read, semantic create, and the role of knowledge graph prompt inclusion. With this new benchmarking tasks, we evaluated a selection of GPT, Gemini, and Claude models. Our findings indicate that working with SPARQL SELECT queries is still challenging for LLMs and heavily depends on the specific LLM as well as the complexity of the task. While fixing basic syntax errors seems to pose no problems for the best of the current LLMs evaluated, creating semantically correct SPARQL SELECT queries is difficult in several cases. Comments: peer reviewed publication at NLP4KGc @ Semantics 2024, see this https URL Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2409.05925 [cs.DB] (or arXiv:2409.05925v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2409.05925 Focus to learn more arXiv-issued DOI via DataCite
摘要：大型语言模型(LLM)与知识图(KG)的集成为知识驱动的应用提供了巨大的协同潜力。一种可能的集成是形式语言的解释和生成，例如语义Web中使用的语言，其中SPARQL是访问KGS的核心技术。在本文中，我们将重点测量LLM的开箱即用能力，以使用SPARQL，更具体地说，使用量化方法使用SPARQL SELECT查询。我们在LLM-KG-BASE框架中实现了各种基准测试任务，以实现自动化执行和使用多个LLM进行评估。这些任务根据句法、语义阅读、语义创建和知识图谱提示包含的角色来评估能力。通过这个新的基准任务，我们评估了精选的GPT、Gemini和Claude模型。我们的发现表明，使用SPARQL SELECT查询对于LLM来说仍然是具有挑战性的，并且在很大程度上取决于特定的LLM以及任务的复杂性。尽管修复基本的语法错误对于当前评估的LLM来说似乎不会带来问题，但在某些情况下创建语义正确的SPARQL SELECT查询是困难的。评论：同行评议发表在NLP4KGc@Semantics 2024年，请参阅以下HTTPS URL主题：数据库(cs.DB)；人工智能(cs.AI)；计算和语言(cs.CL)；信息检索(cs.IR)引用为：arxiv：2409.05925cs.db https://doi.org/10.48550/arXiv.2409.05925 Focus通过DataCite了解更多arxiv发布的文档

[NLP-47] Programming Refusal with Conditional Activation Steering
[NLP-47] 有条件激活引导的编程拒绝

链接: https://arxiv.org/abs/2409.05907
作者: Bruce W. Lee,Inkit Padhi,Karthikeyan Natesan Ramamurthy,Erik Miehling,Pierre Dognin,Manish Nagireddy,Amit Dhurandhar
关键词-EN: shown remarkable capabilities, behavior remains challenging, remarkable capabilities, remains challenging, activation steering
关键词-ZH: 表现出非凡的能力，行为仍然具有挑战性，非凡的能力，仍然具有挑战性，激活引导
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLMs have shown remarkable capabilities, but precisely controlling their response behavior remains challenging. Existing activation steering methods alter LLM behavior indiscriminately, limiting their practical applicability in settings where selective responses are essential, such as content moderation or domain-specific assistants. In this paper, we propose Conditional Activation Steering (CAST), which analyzes LLM activation patterns during inference to selectively apply or withhold activation steering based on the input context. Our method is based on the observation that different categories of prompts activate distinct patterns in the model’s hidden states. Using CAST, one can systematically control LLM behavior with rules like “if input is about hate speech or adult content, then refuse” or “if input is not about legal advice, then refuse.” This allows for selective modification of responses to specific content while maintaining normal responses to other content, all without requiring weight optimization. We release an open-source implementation of our framework.
摘要：LLM已经显示出非凡的能力，但精确控制它们的反应行为仍然具有挑战性。现有的激活控制方法不分青红皂白地改变LLM行为，限制了它们在选择性反应必不可少的环境中的实用适用性，例如内容审核或领域特定的助手。在本文中，我们提出了条件激活引导(CAST)，它在推理过程中分析LLM激活模式，根据输入上下文有选择地应用或保留激活引导。我们的方法是基于这样的观察，即不同类别的提示激活了模型隐藏状态中的不同模式。使用CAST，人们可以通过这样的规则系统地控制LLM的行为：“如果输入与仇恨言论或成人内容有关，则拒绝”或“如果输入与法律建议无关，则拒绝”。这允许选择性地修改对特定内容的响应，同时保持对其他内容的正常响应，所有这些都不需要优化权重。我们发布了我们框架的开源实现。

[NLP-48] Generative User-Experience Research for Developing Domain-specific Natural Language Processing Applications
[NLP-48] 开发特定领域自然语言处理应用程序的生成式用户体验研究

链接: https://arxiv.org/abs/2306.16143
作者: Anastasia Zhukova,Lukas von Sperl,Christian E. Matt,Bela Gipp
关键词-EN: human-computer interaction, increasing intuitiveness, part of human-computer, NLP, focuses on increasing
关键词-ZH: 人机交互，增加直观性，人机NLP的一部分，重点是增加
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:User experience (UX) is a part of human-computer interaction (HCI) research and focuses on increasing intuitiveness, transparency, simplicity, and trust for the system users. Most UX research for machine learning (ML) or natural language processing (NLP) focuses on a data-driven methodology. It engages domain users mainly for usability evaluation. Moreover, more typical UX methods tailor the systems towards user usability, unlike learning about the user needs first. This paper proposes a new methodology for integrating generative UX research into developing domain NLP applications. Generative UX research employs domain users at the initial stages of prototype development, i.e., ideation and concept evaluation, and the last stage for evaluating system usefulness and user utility. The methodology emerged from and is evaluated on a case study about the full-cycle prototype development of a domain-specific semantic search for daily operations in the process industry. A key finding of our case study is that involving domain experts increases their interest and trust in the final NLP application. The combined UX+NLP research of the proposed method efficiently considers data- and user-driven opportunities and constraints, which can be crucial for developing NLP applications.
摘要：用户体验(UX)是人机交互(HCI)研究的一部分，其重点是增加系统用户的直观性、透明度、简单性和信任度。机器学习(ML)或自然语言处理(NLP)的大多数UX研究都集中在数据驱动的方法论上。它主要面向领域用户进行可用性评估。此外，与首先了解用户需求不同，更典型的UX方法针对用户可用性定制系统。本文提出了一种将产生式用户体验研究集成到领域自然语言处理应用开发中的新方法。生成性用户体验研究在原型开发的初始阶段，即构思和概念评估，以及评估系统有用性和用户效用的最后阶段使用领域用户。该方法产生于流程工业日常操作的特定领域语义搜索的全周期原型开发的案例研究，并在此案例研究中进行了评估。我们案例研究的一个关键发现是，让领域专家参与进来会增加他们对最终NLP应用程序的兴趣和信任。该方法的UX+NLP组合研究有效地考虑了数据和用户驱动的机会和约束，这对开发NLP应用程序至关重要。

[NLP-49] Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens
[NLP-49] 排序器：通过桥梁时间戳和令牌无缝集成说话者拨号和ASB

链接: https://arxiv.org/abs/2409.06656
作者: Taejin Park,Ivan Medennikov,Kunal Dhawan,Weiqing Wang,He Huang,Nithin Rao Koluguri,Krishna C. Puvvada,Jagadeesh Balam,Boris Ginsburg
关键词-EN: unconventional objectives compared, compared to existing, Sort Loss, diarization, PIL
关键词-ZH: 非常规目标比较，与现有目标相比，排序损失、日记化、PIL
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注:

点击查看摘要

Abstract:We propose Sortformer, a novel neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models. The permutation problem in speaker diarization has long been regarded as a critical challenge. Most prior end-to-end diarization systems employ permutation invariant loss (PIL), which optimizes for the permutation that yields the lowest error. In contrast, we introduce Sort Loss, which enables a diarization model to autonomously resolve permutation, with or without PIL. We demonstrate that combining Sort Loss and PIL achieves performance competitive with state-of-the-art end-to-end diarization models trained exclusively with PIL. Crucially, we present a streamlined multispeaker ASR architecture that leverages Sortformer as a speaker supervision model, embedding speaker label estimation within the ASR encoder state using a sinusoidal kernel function. This approach resolves the speaker permutation problem through sorted objectives, effectively bridging speaker-label timestamps and speaker tokens. In our experiments, we show that the proposed multispeaker ASR architecture, enhanced with speaker supervision, improves performance via adapter techniques. Code and trained models will be made publicly available via the NVIDIA NeMo framework
摘要：与已有的端到端二元化模型相比，我们提出了一种新的非常规目标训练的说话人二元化神经模型Sortform。说话人二元化中的置换问题一直被认为是一个关键的挑战。大多数现有的端到端二元化系统使用置换不变损失(PIL)，其针对产生最低误差的置换进行优化。相反，我们引入了排序损失，这使得二元化模型能够在有或没有PIL的情况下自动解析排列。我们证明，结合排序损失和PIL可以获得与仅使用PIL训练的最先进的端到端对分模型相媲美的性能。重要的是，我们提出了一种简化的多说话人ASR架构，该架构利用Sortform作为说话人监督模型，使用正弦核函数将说话人标签估计嵌入ASR编码器状态。该方法通过排序目标来解决说话人置换问题，有效地在说话人标签时间戳和说话人标记之间架起桥梁。在我们的实验中，我们证明了所提出的多说话人ASR架构，在说话人监督的基础上，通过适配器技术提高了性能。代码和经过培训的模型将通过NVIDIA NEMO框架公开提供

人工智能

[AI-0] Hint-AD: Holistically Aligned Interpretability in End-to-End Autonomous Driving

链接: https://arxiv.org/abs/2409.06702
作者: Kairui Ding,Boyuan Chen,Yuchen Su,Huan-ang Gao,Bu Jin,Chonghao Sima,Wuqiang Zhang,Xiaohui Li,Paul Barsch,Hongyang Li,Hao Zhao
关键词-EN: impeding human-AI trust, architectures in autonomous, face a significant, impeding human-AI, human-AI trust
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: CoRL 2024, Project Page: this https URL

点击查看摘要

Abstract:End-to-end architectures in autonomous driving (AD) face a significant challenge in interpretability, impeding human-AI trust. Human-friendly natural language has been explored for tasks such as driving explanation and 3D captioning. However, previous works primarily focused on the paradigm of declarative interpretability, where the natural language interpretations are not grounded in the intermediate outputs of AD systems, making the interpretations only declarative. In contrast, aligned interpretability establishes a connection between language and the intermediate outputs of AD systems. Here we introduce Hint-AD, an integrated AD-language system that generates language aligned with the holistic perception-prediction-planning outputs of the AD model. By incorporating the intermediate outputs and a holistic token mixer sub-network for effective feature adaptation, Hint-AD achieves desirable accuracy, achieving state-of-the-art results in driving language tasks including driving explanation, 3D dense captioning, and command prediction. To facilitate further study on driving explanation task on nuScenes, we also introduce a human-labeled dataset, Nu-X. Codes, dataset, and models will be publicly available.

[AI-1] HybridFC: A Hybrid Fact-Checking Approach for Knowledge Graphs

链接: https://arxiv.org/abs/2409.06692
作者: Umair Qudus,Michael Roeder,Muhammad Saleem,Axel-Cyrille Ngonga Ngomo
关键词-EN: knowledge graphs, aim to predict, predict the veracity, veracity of assertions, fact-checking approaches
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:We consider fact-checking approaches that aim to predict the veracity of assertions in knowledge graphs. Five main categories of fact-checking approaches for knowledge graphs have been proposed in the recent literature, of which each is subject to partially overlapping limitations. In particular, current text-based approaches are limited by manual feature engineering. Path-based and rule-based approaches are limited by their exclusive use of knowledge graphs as background knowledge, and embedding-based approaches suffer from low accuracy scores on current fact-checking tasks. We propose a hybrid approach – dubbed HybridFC – that exploits the diversity of existing categories of fact-checking approaches within an ensemble learning setting to achieve a significantly better prediction performance. In particular, our approach outperforms the state of the art by 0.14 to 0.27 in terms of Area Under the Receiver Operating Characteristic curve on the FactBench dataset. Our code is open-source and can be found at this https URL.

[AI-2] Geometric-Averaged Preference Optimization for Soft Preference Labels

链接: https://arxiv.org/abs/2409.06691
作者: Hiroki Furuta,Kuang-Huei Lee,Shixiang Shane Gu,Yutaka Matsuo,Aleksandra Faust,Heiga Zen,Izzeddin Gur
关键词-EN: human preferences assume, human preferences, Direct Preference Optimization, soft preference labels, algorithms for aligning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[AI-3] Benchmarking Sub-Genre Classification For Mainstage Dance Music ICASSP2025

链接: https://arxiv.org/abs/2409.06690
作者: Hongzhi Shu,Xinglin Li,Hongyu Jiang,Minghao Fu,Xinyu Li
关键词-EN: music information retrieval, information retrieval, wide range, prominent tasks, Music
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:Music classification, with a wide range of applications, is one of the most prominent tasks in music information retrieval. To address the absence of comprehensive datasets and high-performing methods in the classification of mainstage dance music, this work introduces a novel benchmark comprising a new dataset and a baseline. Our dataset extends the number of sub-genres to cover most recent mainstage live sets by top DJs worldwide in music festivals. A continuous soft labeling approach is employed to account for tracks that span multiple sub-genres, preserving the inherent sophistication. For the baseline, we developed deep learning models that outperform current state-of-the-art multimodel language models, which struggle to identify house music sub-genres, emphasizing the need for specialized models trained on fine-grained datasets. Our benchmark is applicable to serve for application scenarios such as music recommendation, DJ set curation, and interactive multimedia, where we also provide video demos. Our code is on \urlhttps://anonymous.4open.science/r/Mainstage-EDM-Benchmark/.

[AI-4] Liability and Insurance for Catastrophic Losses: the Nuclear Power Precedent and Lessons for AI ICML2024

链接: https://arxiv.org/abs/2409.06673
作者: Cristian Trout
关键词-EN: causing catastrophic losses, potentially causing catastrophic, caused catastrophic losses, catastrophic losses, autonomous and capable
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to Generative AI and Law Workshop at the International Conference on Machine Learning (ICML 2024)

点击查看摘要

Abstract:As AI systems become more autonomous and capable, experts warn of them potentially causing catastrophic losses. Drawing on the successful precedent set by the nuclear power industry, this paper argues that developers of frontier AI models should be assigned limited, strict, and exclusive third party liability for harms resulting from Critical AI Occurrences (CAIOs) - events that cause or easily could have caused catastrophic losses. Mandatory insurance for CAIO liability is recommended to overcome developers’ judgment-proofness, mitigate winner’s curse dynamics, and leverage insurers’ quasi-regulatory abilities. Based on theoretical arguments and observations from the analogous nuclear power context, insurers are expected to engage in a mix of causal risk-modeling, monitoring, lobbying for stricter regulation, and providing loss prevention guidance in the context of insuring against heavy-tail risks from AI. While not a substitute for regulation, clear liability assignment and mandatory insurance can help efficiently allocate resources to risk-modeling and safe design, facilitating future regulatory efforts.

[AI-5] Insuring Uninsurable Risks from AI: The State as Insurer of Last Resort ICML2024

链接: https://arxiv.org/abs/2409.06672
作者: Cristian Trout
关键词-EN: pose uninsurable risks, systems will sooner, pose uninsurable, including existential risks, Quadratic Financing
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to Generative AI and Law Workshop at the International Conference on Machine Learning (ICML 2024)

点击查看摘要

Abstract:Many experts believe that AI systems will sooner or later pose uninsurable risks, including existential risks. This creates an extreme judgment-proof problem: few if any parties can be held accountable ex post in the event of such a catastrophe. This paper proposes a novel solution: a government-provided, mandatory indemnification program for AI developers. The program uses risk-priced indemnity fees to induce socially optimal levels of care. Risk-estimates are determined by surveying experts, including indemnified developers. The Bayesian Truth Serum mechanism is employed to incent honest and effortful responses. Compared to alternatives, this approach arguably better leverages all private information, and provides a clearer signal to indemnified developers regarding what risks they must mitigate to lower their fees. It’s recommended that collected fees be used to help fund the safety research developers need, employing a fund matching mechanism (Quadratic Financing) to induce an optimal supply of this public good. Under Quadratic Financing, safety research projects would compete for private contributions from developers, signaling how much each is to be supplemented with public funds.

[AI-6] LLaMA-Omni: Seamless Speech Interaction with Large Language Models

链接: https://arxiv.org/abs/2409.06666
作者: Qingkai Fang,Shoutao Guo,Yan Zhou,Zhengrui Ma,Shaolei Zhang,Yang Feng
关键词-EN: significantly enhancing user, enhancing user experience, enable real-time interaction, traditional text-based interaction, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Preprint. Project: this https URL

点击查看摘要

[AI-7] World-Grounded Human Motion Recovery via Gravity-View Coordinates SIGGRAPH

链接: https://arxiv.org/abs/2409.06662
作者: Zehong Shen,Huaijin Pi,Yan Xia,Zhi Cen,Sida Peng,Zechen Hu,Hujun Bao,Ruizhen Hu,Xiaowei Zhou
关键词-EN: world coordinate system, recovering world-grounded human, coordinate system, world coordinate, monocular video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at SIGGRAPH Asia 2024 (Conference Track). Project page: this https URL

点击查看摘要

Abstract:We present a novel method for recovering world-grounded human motion from monocular video. The main challenge lies in the ambiguity of defining the world coordinate system, which varies between sequences. Previous approaches attempt to alleviate this issue by predicting relative motion in an autoregressive manner, but are prone to accumulating errors. Instead, we propose estimating human poses in a novel Gravity-View (GV) coordinate system, which is defined by the world gravity and the camera view direction. The proposed GV system is naturally gravity-aligned and uniquely defined for each video frame, largely reducing the ambiguity of learning image-pose mapping. The estimated poses can be transformed back to the world coordinate system using camera rotations, forming a global motion sequence. Additionally, the per-frame estimation avoids error accumulation in the autoregressive methods. Experiments on in-the-wild benchmarks demonstrate that our method recovers more realistic motion in both the camera space and world-grounded settings, outperforming state-of-the-art methods in both accuracy and speed. The code is available at this https URL.

[AI-8] EyeCLIP: A visual-language foundation model for multi-modal ophthalmic image analysis

链接: https://arxiv.org/abs/2409.06644
作者: Danli Shi,Weiyi Zhang,Jiancheng Yang,Siyu Huang,Xiaolan Chen,Mayinuer Yusufu,Kai Jin,Shan Lin,Shunming Liu,Qing Zhang,Mingguang He
关键词-EN: preventing vision loss, Early detection, macular degeneration, vision loss, diabetic retinopathy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Early detection of eye diseases like glaucoma, macular degeneration, and diabetic retinopathy is crucial for preventing vision loss. While artificial intelligence (AI) foundation models hold significant promise for addressing these challenges, existing ophthalmic foundation models primarily focus on a single modality, whereas diagnosing eye diseases requires multiple modalities. A critical yet often overlooked aspect is harnessing the multi-view information across various modalities for the same patient. Additionally, due to the long-tail nature of ophthalmic diseases, standard fully supervised or unsupervised learning approaches often struggle. Therefore, it is essential to integrate clinical text to capture a broader spectrum of diseases. We propose EyeCLIP, a visual-language foundation model developed using over 2.77 million multi-modal ophthalmology images with partial text data. To fully leverage the large multi-modal unlabeled and labeled data, we introduced a pretraining strategy that combines self-supervised reconstructions, multi-modal image contrastive learning, and image-text contrastive learning to learn a shared representation of multiple modalities. Through evaluation using 14 benchmark datasets, EyeCLIP can be transferred to a wide range of downstream tasks involving ocular and systemic diseases, achieving state-of-the-art performance in disease classification, visual question answering, and cross-modal retrieval. EyeCLIP represents a significant advancement over previous methods, especially showcasing few-shot, even zero-shot capabilities in real-world long-tail scenarios.

[AI-9] MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders

链接: https://arxiv.org/abs/2409.06635
作者: Wenyu Zhang,Shuo Sun,Bin Wang,Xunlong Zou,Zhuohan Liu,Yingxu He,Geyu Lin,Nancy F. Chen,Ai Ti Aw
关键词-EN: language processing capabilities, natural language processing, enhanced natural language, inputs alongside text, large language models
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

[AI-10] A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio

链接: https://arxiv.org/abs/2409.06624
作者: Ningyuan Xi,Yetao Wu,Kun Fan,Teng Chen,Qingqing Gu,Peng Yu,Jinxian Qu,Chenxi Liu,Zhonglin Jiang,Yong Chen,Luo Ji
关键词-EN: Large Language Models, unfamiliar language skill, Continual Pre-Trained, Language Mixture Ratio, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 pages, 4 figures

点击查看摘要

[AI-11] One-Shot Imitation under Mismatched Execution

链接: https://arxiv.org/abs/2409.06615
作者: Kushal Kedia,Prithwish Dan,Sanjiban Choudhury
关键词-EN: long-horizon manipulation tasks, Human demonstrations, demonstrations, manipulation tasks, long-horizon manipulation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Human demonstrations as prompts are a powerful way to program robots to do long-horizon manipulation tasks. However, directly translating such demonstrations into robot-executable actions poses significant challenges due to execution mismatches, such as different movement styles and physical capabilities. Existing methods either rely on robot-demonstrator paired data, which is infeasible to scale, or overly rely on frame-level visual similarities, which fail to hold. To address these challenges, we propose RHyME, a novel framework that automatically establishes task execution correspondences between the robot and the demonstrator by using optimal transport costs. Given long-horizon robot demonstrations, RHyME synthesizes semantically equivalent human demonstrations by retrieving and composing similar short-horizon human clips, facilitating effective policy training without the need for paired data. We show that RHyME outperforms a range of baselines across various cross-embodiment datasets on all degrees of mismatches. Through detailed analysis, we uncover insights for learning and leveraging cross-embodiment visual representations.

[AI-12] Label-free Monitoring of Self-Supervised Learning Progress

链接: https://arxiv.org/abs/2409.06612
作者: Isaac Xu,Scott Lowe,Thomas Trappenberg
关键词-EN: high-level embedding space, exploiting unlabelled data, downstream tasks, Self-supervised learning, learn a high-level
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) is an effective method for exploiting unlabelled data to learn a high-level embedding space that can be used for various downstream tasks. However, existing methods to monitor the quality of the encoder – either during training for one model or to compare several trained models – still rely on access to annotated data. When SSL methodologies are applied to new data domains, a sufficiently large labelled dataset may not always be available. In this study, we propose several evaluation metrics which can be applied on the embeddings of unlabelled data and investigate their viability by comparing them to linear probe accuracy (a common metric which utilizes an annotated dataset). In particular, we apply k -means clustering and measure the clustering quality with the silhouette score and clustering agreement. We also measure the entropy of the embedding distribution. We find that while the clusters did correspond better to the ground truth annotations as training of the network progressed, label-free clustering metrics correlated with the linear probe accuracy only when training with SSL methods SimCLR and MoCo-v2, but not with SimSiam. Additionally, although entropy did not always have strong correlations with LP accuracy, this appears to be due to instability arising from early training, with the metric stabilizing and becoming more reliable at later stages of learning. Furthermore, while entropy generally decreases as learning progresses, this trend reverses for SimSiam. More research is required to establish the cause for this unexpected behaviour. Lastly, we find that while clustering based approaches are likely only viable for same-architecture comparisons, entropy may be architecture-independent.

[AI-13] Simulation-based Scenario Generation for Robust Hybrid AI for Autonomy

链接: https://arxiv.org/abs/2409.06608
作者: Hambisa Keno,Nicholas J. Pioch,Christopher Guagliano,Timothy H. Chung
关键词-EN: Unmanned Aerial Vehicles, Aerial Vehicles, Unmanned Aerial, emergency management, search and rescue
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 6 pages, 5 figures, 1 table

点击查看摘要

Abstract:Application of Unmanned Aerial Vehicles (UAVs) in search and rescue, emergency management, and law enforcement has gained traction with the advent of low-cost platforms and sensor payloads. The emergence of hybrid neural and symbolic AI approaches for complex reasoning is expected to further push the boundaries of these applications with decreasing levels of human intervention. However, current UAV simulation environments lack semantic context suited to this hybrid approach. To address this gap, HAMERITT (Hybrid Ai Mission Environment for RapId Training and Testing) provides a simulation-based autonomy software framework that supports the training, testing and assurance of neuro-symbolic algorithms for autonomous maneuver and perception reasoning. HAMERITT includes scenario generation capabilities that offer mission-relevant contextual symbolic information in addition to raw sensor data. Scenarios include symbolic descriptions for entities of interest and their relations to scene elements, as well as spatial-temporal constraints in the form of time-bounded areas of interest with prior probabilities and restricted zones within those areas. HAMERITT also features support for training distinct algorithm threads for maneuver vs. perception within an end-to-end mission run. Future work includes improving scenario realism and scaling symbolic context generation through automated workflow.

[AI-14] An Ontology-based Approach Towards Traceable Behavior Specifications in Automated Driving

链接: https://arxiv.org/abs/2409.06607
作者: Nayel Fabian Salem,Marcus Nolte,Veronica Haber,Till Menzel,Hans Steege,Robert Graubohm,Markus Maurer
关键词-EN: Automated Driving Systems, Semantic Norm Behavior, Norm Behavior Analysis, Automated Driving, number of expectations
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: 22 pages, 12 figures, submitted for publication

点击查看摘要

Abstract:Vehicles in public traffic that are equipped with Automated Driving Systems are subject to a number of expectations: Among other aspects, their behavior should be safe, conforming to the rules of the road and provide mobility to their users. This poses challenges for the developers of such systems: Developers are responsible for specifying this behavior, for example, in terms of requirements at system design time. As we will discuss in the article, this specification always involves the need for assumptions and trade-offs. As a result, insufficiencies in such a behavior specification can occur that can potentially lead to unsafe system behavior. In order to support the identification of specification insufficiencies, requirements and respective assumptions need to be made explicit. In this article, we propose the Semantic Norm Behavior Analysis as an ontology-based approach to specify the behavior for an Automated Driving System equipped vehicle. We use ontologies to formally represent specified behavior for a targeted operational environment, and to establish traceability between specified behavior and the addressed stakeholder needs. Furthermore, we illustrate the application of the Semantic Norm Behavior Analysis in two example scenarios and evaluate our results.

[AI-15] Developing the Temporal Graph Convolutional Neural Network Model to Predict Hip Replacement using Electronic Health Records ICML

链接: https://arxiv.org/abs/2409.06585
作者: Zoe Hancox,Sarah R. Kingsbury,Andrew Clegg,Philip G. Conaghan,Samuel D. Relton
关键词-EN: Hip replacement, predicts hip replacement, replacement procedures improve, hip replacement risk, Hip replacement procedures
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to the 2024 International Conference on Machine Learning and Applications (ICMLA). 8 pages, 3 figures, 7 tables

点击查看摘要

Abstract:Background: Hip replacement procedures improve patient lives by relieving pain and restoring mobility. Predicting hip replacement in advance could reduce pain by enabling timely interventions, prioritising individuals for surgery or rehabilitation, and utilising physiotherapy to potentially delay the need for joint replacement. This study predicts hip replacement a year in advance to enhance quality of life and health service efficiency. Methods: Adapting previous work using Temporal Graph Convolutional Neural Network (TG-CNN) models, we construct temporal graphs from primary care medical event codes, sourced from ResearchOne EHRs of 40-75-year-old patients, to predict hip replacement risk. We match hip replacement cases to controls by age, sex, and Index of Multiple Deprivation. The model, trained on 9,187 cases and 9,187 controls, predicts hip replacement one year in advance. We validate the model on two unseen datasets, recalibrating for class imbalance. Additionally, we conduct an ablation study and compare against four baseline models. Results: Our best model predicts hip replacement risk one year in advance with an AUROC of 0.724 (95% CI: 0.715-0.733) and an AUPRC of 0.185 (95% CI: 0.160-0.209), achieving a calibration slope of 1.107 (95% CI: 1.074-1.139) after recalibration. Conclusions: The TG-CNN model effectively predicts hip replacement risk by identifying patterns in patient trajectories, potentially improving understanding and management of hip-related conditions.

[AI-16] Quantifying and Enabling the Interpretability of CLIP-like Models

链接: https://arxiv.org/abs/2409.06579
作者: Avinash Madasu,Yossi Gandelsman,Vasudev Lal,Phillip Howard
关键词-EN: popular foundational models, CLIP, vision-language tasks, CLIP models, popular foundational
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:CLIP is one of the most popular foundational models and is heavily used for many vision-language tasks. However, little is known about the inner workings of CLIP. To bridge this gap we propose a study to quantify the interpretability in CLIP like models. We conduct this study on six different CLIP models from OpenAI and OpenCLIP which vary by size, type of pre-training data and patch size. Our approach begins with using the TEXTSPAN algorithm and in-context learning to break down individual attention heads into specific properties. We then evaluate how easily these heads can be interpreted using new metrics which measure property consistency within heads and property disentanglement across heads. Our findings reveal that larger CLIP models are generally more interpretable than their smaller counterparts. To further assist users in understanding the inner workings of CLIP models, we introduce CLIP-InterpreT, a tool designed for interpretability analysis. CLIP-InterpreT offers five types of analyses: property-based nearest neighbor search, per-head topic segmentation, contrastive segmentation, per-head nearest neighbors of an image, and per-head nearest neighbors of text.

[AI-17] Indirect Dynamic Negotiation in the Nash Demand Game

链接: https://arxiv.org/abs/2409.06566
作者: Tatiana V. Guy,Jitka Homolová,Aleksej Gaj
关键词-EN: sequential bilateral bargaining, incomplete information, addresses a problem, problem of sequential, sequential bilateral
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注: Appears in IEEE Access

点击查看摘要

Abstract:The paper addresses a problem of sequential bilateral bargaining with incomplete information. We proposed a decision model that helps agents to successfully bargain by performing indirect negotiation and learning the opponent’s model. Methodologically the paper casts heuristically-motivated bargaining of a self-interested independent player into a framework of Bayesian learning and Markov decision processes. The special form of the reward implicitly motivates the players to negotiate indirectly, via closed-loop interaction. We illustrate the approach by applying our model to the Nash demand game, which is an abstract model of bargaining. The results indicate that the established negotiation: i) leads to coordinating players’ actions; ii) results in maximising success rate of the game and iii) brings more individual profit to the players.

[AI-18] ChatGPTs Potential in Cryptography Misuse Detection: A Comparative Analysis with Static Analysis Tools

链接: https://arxiv.org/abs/2409.06561
作者: Ehsan Firouzi,Mohammad Ghafari,Mike Ebrahimi
关键词-EN: widespread API misuse, correct adoption, challenging for mainstream, resulting in widespread, mainstream developers
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: ESEM 2024

点击查看摘要

Abstract:The correct adoption of cryptography APIs is challenging for mainstream developers, often resulting in widespread API misuse. Meanwhile, cryptography misuse detectors have demonstrated inconsistent performance and remain largely inaccessible to most developers. We investigated the extent to which ChatGPT can detect cryptography misuses and compared its performance with that of the state-of-the-art static analysis tools. Our investigation, mainly based on the CryptoAPI-Bench benchmark, demonstrated that ChatGPT is effective in identifying cryptography API misuses, and with the use of prompt engineering, it can even outperform leading static cryptography misuse detectors.

[AI-19] Questioning Internal Knowledge Structure of Large Language Models Through the Lens of the Olympic Games

链接: https://arxiv.org/abs/2409.06518
作者: Juhwan Choi,YoungBin Kim
关键词-EN: Large language models, natural language processing, internal knowledge structures, remain largely unexplored, Large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-20] Sine Transient Noise Neural Modeling of Piano Notes

链接: https://arxiv.org/abs/2409.06513
作者: Riccardo Simionato,Stefano Fasciani
关键词-EN: emulating piano sounds, paper introduces, piano sounds, synthesizer replicating piano, replicating piano notes
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This paper introduces a novel method for emulating piano sounds. We propose to exploit the sine, transient, and noise decomposition to design a differentiable spectral modeling synthesizer replicating piano notes. Three sub-modules learn these components from piano recordings and generate the corresponding harmonic, transient, and noise signals. Splitting the emulation into three independently trainable models reduces the modeling tasks’ complexity. The quasi-harmonic content is produced using a differentiable sinusoidal model guided by physics-derived formulas, whose parameters are automatically estimated from audio recordings. The noise sub-module uses a learnable time-varying filter, and the transients are generated using a deep convolutional network. From singular notes, we emulate the coupling between different keys in trichords with a convolutional-based network. Results show the model matches the partial distribution of the target while predicting the energy in the higher part of the spectrum presents more challenges. The energy distribution in the spectra of the transient and noise components is accurate overall. While the model is more computationally and memory efficient, perceptual tests reveal limitations in accurately modeling the attack phase of notes. Despite this, it generally achieves perceptual accuracy in emulating single notes and trichords.

[AI-21] Aligning Machine and Human Visual Representations across Abstraction Levels

链接: https://arxiv.org/abs/2409.06509
作者: Lukas Muttenthaler,Klaus Greff,Frieda Born,Bernhard Spitzer,Simon Kornblith,Michael C. Mozer,Klaus-Robert Müller,Thomas Unterthiner,Andrew K. Lampinen
关键词-EN: Deep neural networks, Deep neural, neural networks, achieved success, neural network training
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 51 pages

点击查看摘要

Abstract:Deep neural networks have achieved success across a wide range of applications, including as models of human behavior in vision tasks. However, neural network training and human learning differ in fundamental ways, and neural networks often fail to generalize as robustly as humans do, raising questions regarding the similarity of their underlying representations. What is missing for modern learning systems to exhibit more human-like behavior? We highlight a key misalignment between vision models and humans: whereas human conceptual knowledge is hierarchically organized from fine- to coarse-scale distinctions, model representations do not accurately capture all these levels of abstraction. To address this misalignment, we first train a teacher model to imitate human judgments, then transfer human-like structure from its representations into pretrained state-of-the-art vision foundation models. These human-aligned models more accurately approximate human behavior and uncertainty across a wide range of similarity tasks, including a new dataset of human judgments spanning multiple levels of semantic abstractions. They also perform better on a diverse set of machine learning tasks, increasing generalization and out-of-distribution robustness. Thus, infusing neural networks with additional human knowledge yields a best-of-both-worlds representation that is both more consistent with human cognition and more practically useful, thus paving the way toward more robust, interpretable, and human-like artificial intelligence systems.

[AI-22] Elucidating Optimal Reward-Diversity Tradeoffs in Text-to-Image Diffusion Models

链接: https://arxiv.org/abs/2409.06493
作者: Rohit Jena,Ali Taghibakhshi,Sahil Jain,Gerald Shen,Nima Tajbakhsh,Arash Vahdat
关键词-EN: generating high-fidelity images, text prompts, prominent tools, tools for generating, generating high-fidelity
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Text-to-image (T2I) diffusion models have become prominent tools for generating high-fidelity images from text prompts. However, when trained on unfiltered internet data, these models can produce unsafe, incorrect, or stylistically undesirable images that are not aligned with human preferences. To address this, recent approaches have incorporated human preference datasets to fine-tune T2I models or to optimize reward functions that capture these preferences. Although effective, these methods are vulnerable to reward hacking, where the model overfits to the reward function, leading to a loss of diversity in the generated images. In this paper, we prove the inevitability of reward hacking and study natural regularization techniques like KL divergence and LoRA scaling, and their limitations for diffusion models. We also introduce Annealed Importance Guidance (AIG), an inference-time regularization inspired by Annealed Importance Sampling, which retains the diversity of the base model while achieving Pareto-Optimal reward-diversity tradeoffs. Our experiments demonstrate the benefits of AIG for Stable Diffusion models, striking the optimal balance between reward optimization and image diversity. Furthermore, a user study confirms that AIG improves diversity and quality of generated images across different model architectures and reward functions.

[AI-23] Superior Computer Chess with Model Predictive Control Reinforcement Learning and Rollout

链接: https://arxiv.org/abs/2409.06477
作者: Atharva Gundawar,Yuchao Li,Dimitri Bertsekas
关键词-EN: apply model predictive, model predictive control, methodologies to computer, paper we apply, apply model
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:In this paper we apply model predictive control (MPC), rollout, and reinforcement learning (RL) methodologies to computer chess. We introduce a new architecture for move selection, within which available chess engines are used as components. One engine is used to provide position evaluations in an approximation in value space MPC/RL scheme, while a second engine is used as nominal opponent, to emulate or approximate the moves of the true opponent player. We show that our architecture improves substantially the performance of the position evaluation engine. In other words our architecture provides an additional layer of intelligence, on top of the intelligence of the engines on which it is based. This is true for any engine, regardless of its strength: top engines such as Stockfish and Komodo Dragon (of varying strengths), as well as weaker engines. Structurally, our basic architecture selects moves by a one-move lookahead search, with an intermediate move generated by a nominal opponent engine, and followed by a position evaluation by another chess engine. Simpler schemes that forego the use of the nominal opponent, also perform better than the position evaluator, but not quite by as much. More complex schemes, involving multistep lookahead, may also be used and generally tend to perform better as the length of the lookahead increases. Theoretically, our methodology relies on generic cost improvement properties and the superlinear convergence framework of Newton’s method, which fundamentally underlies approximation in value space, and related MPC/RL and rollout/policy iteration schemes. A critical requirement of this framework is that the first lookahead step should be executed exactly. This fact has guided our architectural choices, and is apparently an important factor in improving the performance of even the best available chess engines. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY) Cite as: arXiv:2409.06477 [cs.AI] (or arXiv:2409.06477v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2409.06477 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yuchao Li [view email] [v1] Tue, 10 Sep 2024 13:05:45 UTC (2,218 KB)

[AI-24] An Effective Context-Balanced Adaptation Approach for Long-Tailed Speech Recognition

链接: https://arxiv.org/abs/2409.06468
作者: Yi-Cheng Wang,Li-Ting Pai,Bi-Cheng Yan,Hsin-Wei Wang,Chi-Han Lin,Berlin Chen
关键词-EN: automatic speech recognition, ASR models, context list, ASR, automatic speech
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted by SLT 2024

点击查看摘要

[AI-25] Multimodal Large Language Model Driven Scenario Testing for Autonomous Vehicles

链接: https://arxiv.org/abs/2409.06450
作者: Qiujing Lu,Xuanhan Wang,Yiwei Jiang,Guangming Zhao,Mingyue Ma,Shuo Feng
关键词-EN: autonomous vehicles prior, efficiently testing autonomous, road deployment, corner cases, increasingly crucial
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:The generation of corner cases has become increasingly crucial for efficiently testing autonomous vehicles prior to road deployment. However, existing methods struggle to accommodate diverse testing requirements and often lack the ability to generalize to unseen situations, thereby reducing the convenience and usability of the generated scenarios. A method that facilitates easily controllable scenario generation for efficient autonomous vehicles (AV) testing with realistic and challenging situations is greatly needed. To address this, we proposed OmniTester: a multimodal Large Language Model (LLM) based framework that fully leverages the extensive world knowledge and reasoning capabilities of LLMs. OmniTester is designed to generate realistic and diverse scenarios within a simulation environment, offering a robust solution for testing and evaluating AVs. In addition to prompt engineering, we employ tools from Simulation of Urban Mobility to simplify the complexity of codes generated by LLMs. Furthermore, we incorporate Retrieval-Augmented Generation and a self-improvement mechanism to enhance the LLM’s understanding of scenarios, thereby increasing its ability to produce more realistic scenes. In the experiments, we demonstrated the controllability and realism of our approaches in generating three types of challenging and complex scenarios. Additionally, we showcased its effectiveness in reconstructing new scenarios described in crash report, driven by the generalization capability of LLMs.

[AI-26] HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data

链接: https://arxiv.org/abs/2409.06446
作者: Hossein Hajipour,Lea Schönherr,Thorsten Holz,Mario Fritz
关键词-EN: shown great potential, GitHub Copilot, Large language models, Large language, shown great
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 24 pages, 16 tables, 8 figures

点击查看摘要

[AI-27] Learning Generative Interactive Environments By Trained Agent Exploration

链接: https://arxiv.org/abs/2409.06445
作者: Naser Kazemi,Nedko Savov,Danda Paudel,Luc Van Gool
关键词-EN: World models, increasingly pivotal, pivotal in interpreting, interpreting and simulating, simulating the rules
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:World models are increasingly pivotal in interpreting and simulating the rules and actions of complex environments. Genie, a recent model, excels at learning from visually diverse environments but relies on costly human-collected data. We observe that their alternative method of using random agents is too limited to explore the environment. We propose to improve the model by employing reinforcement learning based agents for data generation. This approach produces diverse datasets that enhance the model’s ability to adapt and perform well across various scenarios and realistic actions within the environment. In this paper, we first release the model GenieRedux - an implementation based on Genie. Additionally, we introduce GenieRedux-G, a variant that uses the agent’s readily available actions to factor out action prediction uncertainty during validation. Our evaluation, including a replication of the Coinrun case study, shows that GenieRedux-G achieves superior visual fidelity and controllability using the trained agent exploration. The proposed approach is reproducable, scalable and adaptable to new types of environments. Our codebase is available at this https URL .

[AI-28] GeMuCo: Generalized Multisensory Correlational Model for Body Schema Learning

链接: https://arxiv.org/abs/2409.06427
作者: Kento Kawaharazuka,Kei Okada,Masayuki Inaba
关键词-EN: current robots control, autonomously learn, sensation and motion, move while continuously, continuously adapting
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at IEEE Robotics and Automation Magazine

点击查看摘要

Abstract:Humans can autonomously learn the relationship between sensation and motion in their own bodies, estimate and control their own body states, and move while continuously adapting to the current environment. On the other hand, current robots control their bodies by learning the network structure described by humans from their experiences, making certain assumptions on the relationship between sensors and actuators. In addition, the network model does not adapt to changes in the robot’s body, the tools that are grasped, or the environment, and there is no unified theory, not only for control but also for state estimation, anomaly detection, simulation, and so on. In this study, we propose a Generalized Multisensory Correlational Model (GeMuCo), in which the robot itself acquires a body schema describing the correlation between sensors and actuators from its own experience, including model structures such as network input/output. The robot adapts to the current environment by updating this body schema model online, estimates and controls its body state, and even performs anomaly detection and simulation. We demonstrate the effectiveness of this method by applying it to tool-use considering changes in grasping state for an axis-driven robot, to joint-muscle mapping learning for a musculoskeletal robot, and to full-body tool manipulation for a low-rigidity plastic-made humanoid.

[AI-29] Exploring the Integration of Large Language Models in Industrial Test Maintenance Processes

链接: https://arxiv.org/abs/2409.06416
作者: Ludvig Lemner,Linnea Wahlgren,Gregory Gay,Nasser Mohammadiha,Jingxiong Liu,Joakim Wennerberg
关键词-EN: performing test maintenance, test maintenance, software testing process, test, effort required
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Under submission to ACM TOSEM

点击查看摘要

Abstract:Much of the cost and effort required during the software testing process is invested in performing test maintenance - the addition, removal, or modification of test cases to keep the test suite in sync with the system-under-test or to otherwise improve its quality. Tool support could reduce the cost - and improve the quality - of test maintenance by automating aspects of the process or by providing guidance and support to developers. In this study, we explore the capabilities and applications of large language models (LLMs) - complex machine learning models adapted to textual analysis - to support test maintenance. We conducted a case study at Ericsson AB where we explored the triggers that indicate the need for test maintenance, the actions that LLMs can take, and the considerations that must be made when deploying LLMs in an industrial setting. We also proposed and demonstrated implementations of two multi-agent architectures that can predict which test cases require maintenance following a change to the source code. Collectively, these contributions advance our theoretical and practical understanding of how LLMs can be deployed to benefit industrial test maintenance processes. Comments: Under submission to ACM TOSEM Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2409.06416 [cs.SE] (or arXiv:2409.06416v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2409.06416 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-30] Symmetry Breaking in Neural Network Optimization: Insights from Input Dimension Expansion

链接: https://arxiv.org/abs/2409.06402
作者: Jun-Jie Zhang,Nan Cheng,Fu-Peng Li,Xiu-Cheng Wang,Jian-Nan Chen,Long-Gang Pang,Deyu Meng
关键词-EN: symmetry breaking, neural network optimization, symmetry, network optimization, network
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Mathematical Physics (math-ph)
*备注: 29 pages, 8 figures

点击查看摘要

Abstract:Understanding the mechanisms behind neural network optimization is crucial for improving network design and performance. While various optimization techniques have been developed, a comprehensive understanding of the underlying principles that govern these techniques remains elusive. Specifically, the role of symmetry breaking, a fundamental concept in physics, has not been fully explored in neural network optimization. This gap in knowledge limits our ability to design networks that are both efficient and effective. Here, we propose the symmetry breaking hypothesis to elucidate the significance of symmetry breaking in enhancing neural network optimization. We demonstrate that a simple input expansion can significantly improve network performance across various tasks, and we show that this improvement can be attributed to the underlying symmetry breaking mechanism. We further develop a metric to quantify the degree of symmetry breaking in neural networks, providing a practical approach to evaluate and guide network design. Our findings confirm that symmetry breaking is a fundamental principle that underpins various optimization techniques, including dropout, batch normalization, and equivariance. By quantifying the degree of symmetry breaking, our work offers a practical technique for performance enhancement and a metric to guide network design without the need for complete datasets and extensive training processes.

[AI-31] Distilling Generative-Discriminative Representations for Very Low-Resolution Face Recognition

链接: https://arxiv.org/abs/2409.06371
作者: Junzheng Zhang,Weijia Guo,Bochao Liu,Ruixin Shi,Yong Li,Shiming Ge
关键词-EN: low-resolution face recognition, informative facial details, low-resolution face, resolution degradation, challenging due
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Very low-resolution face recognition is challenging due to the serious loss of informative facial details in resolution degradation. In this paper, we propose a generative-discriminative representation distillation approach that combines generative representation with cross-resolution aligned knowledge distillation. This approach facilitates very low-resolution face recognition by jointly distilling generative and discriminative models via two distillation modules. Firstly, the generative representation distillation takes the encoder of a diffusion model pretrained for face super-resolution as the generative teacher to supervise the learning of the student backbone via feature regression, and then freezes the student backbone. After that, the discriminative representation distillation further considers a pretrained face recognizer as the discriminative teacher to supervise the learning of the student head via cross-resolution relational contrastive distillation. In this way, the general backbone representation can be transformed into discriminative head representation, leading to a robust and discriminative student model for very low-resolution face recognition. Our approach improves the recovery of the missing details in very low-resolution faces and achieves better knowledge transfer. Extensive experiments on face datasets demonstrate that our approach enhances the recognition accuracy of very low-resolution faces, showcasing its effectiveness and adaptability.

[AI-32] xture-AD: An Anomaly Detection Dataset and Benchmark for Real Algorithm Development

链接: https://arxiv.org/abs/2409.06367
作者: Tianwu Lei,Bohan Wang,Silin Chen,Shurong Cao,Ningmu Zou
关键词-EN: significant advancements recently, made significant advancements, Anomaly detection, advancements recently, crucial process
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Anomaly detection is a crucial process in industrial manufacturing and has made significant advancements recently. However, there is a large variance between the data used in the development and the data collected by the production environment. Therefore, we present the Texture-AD benchmark based on representative texture-based anomaly detection to evaluate the effectiveness of unsupervised anomaly detection algorithms in real-world applications. This dataset includes images of 15 different cloth, 14 semiconductor wafers and 10 metal plates acquired under different optical schemes. In addition, it includes more than 10 different types of defects produced during real manufacturing processes, such as scratches, wrinkles, color variations and point defects, which are often more difficult to detect than existing datasets. All anomalous areas are provided with pixel-level annotations to facilitate comprehensive evaluation using anomaly detection models. Specifically, to adapt to diverse products in automated pipelines, we present a new evaluation method and results of baseline algorithms. The experimental results show that Texture-AD is a difficult challenge for state-of-the-art algorithms. To our knowledge, Texture-AD is the first dataset to be devoted to evaluating industrial defect detection algorithms in the real world. The dataset is available at https://XXX.

[AI-33] Connecting Concept Convexity and Human-Machine Alignment in Deep Neural Networks

链接: https://arxiv.org/abs/2409.06362
作者: Teresa Dorszewski,Lenka Tětková,Lorenz Linhardt,Lars Kai Hansen
关键词-EN: reliable AI systems, developing more interpretable, interpretable and reliable, human cognitive processes, neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: First two authors contributed equally

点击查看摘要

Abstract:Understanding how neural networks align with human cognitive processes is a crucial step toward developing more interpretable and reliable AI systems. Motivated by theories of human cognition, this study examines the relationship between \emphconvexity in neural network representations and \emphhuman-machine alignment based on behavioral data. We identify a correlation between these two dimensions in pretrained and fine-tuned vision transformer models. Our findings suggest that the convex regions formed in latent spaces of neural networks to some extent align with human-defined categories and reflect the similarity relations humans use in cognitive tasks. While optimizing for alignment generally enhances convexity, increasing convexity through fine-tuning yields inconsistent effects on alignment, which suggests a complex relationship between the two. This study presents a first step toward understanding the relationship between the convexity of latent representations and human-machine alignment.

[AI-34] MAGDA: Multi-agent guideline-driven diagnostic assistance

链接: https://arxiv.org/abs/2409.06351
作者: David Bani-Harouni,Nassir Navab,Matthias Keicher
关键词-EN: rural hospitals, lack fast image, fast image analysis, emergency departments, developed regions
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In emergency departments, rural hospitals, or clinics in less developed regions, clinicians often lack fast image analysis by trained radiologists, which can have a detrimental effect on patients’ healthcare. Large Language Models (LLMs) have the potential to alleviate some pressure from these clinicians by providing insights that can help them in their decision-making. While these LLMs achieve high test results on medical exams showcasing their great theoretical medical knowledge, they tend not to follow medical guidelines. In this work, we introduce a new approach for zero-shot guideline-driven decision support. We model a system of multiple LLM agents augmented with a contrastive vision-language model that collaborate to reach a patient diagnosis. After providing the agents with simple diagnostic guidelines, they will synthesize prompts and screen the image for findings following these guidelines. Finally, they provide understandable chain-of-thought reasoning for their diagnosis, which is then self-refined to consider inter-dependencies between diseases. As our method is zero-shot, it is adaptable to settings with rare diseases, where training data is limited, but expert-crafted disease descriptions are available. We evaluate our method on two chest X-ray datasets, CheXpert and ChestX-ray 14 Longtail, showcasing performance improvement over existing zero-shot methods and generalizability to rare diseases.

[AI-35] VoiceWukong: Benchmarking Deepfake Voice Detection

链接: https://arxiv.org/abs/2409.06348
作者: Ziwei Yan,Yanjie Zhao,Haoyu Wang
关键词-EN: detecting deepfake voices, deepfake voice, increasingly crucial, rapid advancement, advancement of technologies
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:With the rapid advancement of technologies like text-to-speech (TTS) and voice conversion (VC), detecting deepfake voices has become increasingly crucial. However, both academia and industry lack a comprehensive and intuitive benchmark for evaluating detectors. Existing datasets are limited in language diversity and lack many manipulations encountered in real-world production environments. To fill this gap, we propose VoiceWukong, a benchmark designed to evaluate the performance of deepfake voice detectors. To build the dataset, we first collected deepfake voices generated by 19 advanced and widely recognized commercial tools and 15 open-source tools. We then created 38 data variants covering six types of manipulations, constructing the evaluation dataset for deepfake voice detection. VoiceWukong thus includes 265,200 English and 148,200 Chinese deepfake voice samples. Using VoiceWukong, we evaluated 12 state-of-the-art detectors. AASIST2 achieved the best equal error rate (EER) of 13.50%, while all others exceeded 20%. Our findings reveal that these detectors face significant challenges in real-world applications, with dramatically declining performance. In addition, we conducted a user study with more than 300 participants. The results are compared with the performance of the 12 detectors and a multimodel large language model (MLLM), i.e., Qwen2-Audio, where different detectors and humans exhibit varying identification capabilities for deepfake voices at different deception levels, while the LALM demonstrates no detection ability at all. Furthermore, we provide a leaderboard for deepfake voice detection, publicly available at this https URL. Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Audio and Speech Processing (eess.AS) Cite as: arXiv:2409.06348 [cs.SD] (or arXiv:2409.06348v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2409.06348 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-36] Compute-Update Federated Learning: A Lattice Coding Approach

链接: https://arxiv.org/abs/2409.06343
作者: Seyed Mohammad Azimi-Abarghouyi,Lav R. Varshney
关键词-EN: joint source-channel coding, source-channel coding scheme, federated learning framework, framework that enables, computation via digital
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Extended version of the preprint available at arXiv:2403.01023

点击查看摘要

Abstract:This paper introduces a federated learning framework that enables over-the-air computation via digital communications, using a new joint source-channel coding scheme. Without relying on channel state information at devices, this scheme employs lattice codes to both quantize model parameters and exploit interference from the devices. We propose a novel receiver structure at the server, designed to reliably decode an integer combination of the quantized model parameters as a lattice point for the purpose of aggregation. We present a mathematical approach to derive a convergence bound for the proposed scheme and offer design remarks. In this context, we suggest an aggregation metric and a corresponding algorithm to determine effective integer coefficients for the aggregation in each communication round. Our results illustrate that, regardless of channel dynamics and data heterogeneity, our scheme consistently delivers superior learning accuracy across various parameters and markedly surpasses other over-the-air methodologies.

[AI-37] LAMP: Learnable Meta-Path Guided Adversarial Contrastive Learning for Heterogeneous Graphs

链接: https://arxiv.org/abs/2409.06323
作者: Siqing Li,Jin-Duk Park,Wei Huang,Xin Cao,Won-Yong Shin,Zhiqiang Xu
关键词-EN: Heterogeneous graph neural, Heterogeneous graph, Heterogeneous Graph Benchmark, Heterogeneous Graph Contrastive, graph neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 19 pages, 7 figures

点击查看摘要

Abstract:Heterogeneous graph neural networks (HGNNs) have significantly propelled the information retrieval (IR) field. Still, the effectiveness of HGNNs heavily relies on high-quality labels, which are often expensive to acquire. This challenge has shifted attention towards Heterogeneous Graph Contrastive Learning (HGCL), which usually requires pre-defined meta-paths. However, our findings reveal that meta-path combinations significantly affect performance in unsupervised settings, an aspect often overlooked in current literature. Existing HGCL methods have considerable variability in outcomes across different meta-path combinations, thereby challenging the optimization process to achieve consistent and high performance. In response, we introduce \textsfLAMP (\underline\textbfLearn\underline\textbfAble \underline\textbfMeta-\underline\textbfPath), a novel adversarial contrastive learning approach that integrates various meta-path sub-graphs into a unified and stable structure, leveraging the overlap among these sub-graphs. To address the denseness of this integrated sub-graph, we propose an adversarial training strategy for edge pruning, maintaining sparsity to enhance model performance and robustness. \textsfLAMP aims to maximize the difference between meta-path and network schema views for guiding contrastive learning to capture the most meaningful information. Our extensive experimental study conducted on four diverse datasets from the Heterogeneous Graph Benchmark (HGB) demonstrates that \textsfLAMP significantly outperforms existing state-of-the-art unsupervised models in terms of accuracy and robustness.

[AI-38] PharmacoMatch: Efficient 3D Pharmacophore Screening through Neural Subgraph Matching

链接: https://arxiv.org/abs/2409.06316
作者: Daniel Rose,Oliver Wieder,Thomas Seidel,Thierry Langer
关键词-EN: screening libraries poses, drug discovery, necessitating a re-evaluation, big data, increasing size
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:The increasing size of screening libraries poses a significant challenge for the development of virtual screening methods for drug discovery, necessitating a re-evaluation of traditional approaches in the era of big data. Although 3D pharmacophore screening remains a prevalent technique, its application to very large datasets is limited by the computational cost associated with matching query pharmacophores to database ligands. In this study, we introduce PharmacoMatch, a novel contrastive learning approach based on neural subgraph matching. Our method reinterprets pharmacophore screening as an approximate subgraph matching problem and enables efficient querying of conformational databases by encoding query-target relationships in the embedding space. We conduct comprehensive evaluations of the learned representations and benchmark our method on virtual screening datasets in a zero-shot setting. Our findings demonstrate significantly shorter runtimes for pharmacophore matching, offering a promising speed-up for screening very large datasets.

[AI-39] An End-to-End Approach for Chord-Conditioned Song Generation

链接: https://arxiv.org/abs/2409.06307
作者: Shuochen Gao,Shun Lei,Fan Zhuo,Hangyu Liu,Feng Liu,Boshi Tang,Qiaochu Huang,Shiyin Kang,Zhiyong Wu
关键词-EN: synthesize music composed, Generation task aims, aims to synthesize, Song Generation task, Song Generation
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:The Song Generation task aims to synthesize music composed of vocals and accompaniment from given lyrics. While the existing method, Jukebox, has explored this task, its constrained control over the generations often leads to deficiency in music performance. To mitigate the issue, we introduce an important concept from music composition, namely chords, to song generation networks. Chords form the foundation of accompaniment and provide vocal melody with associated harmony. Given the inaccuracy of automatic chord extractors, we devise a robust cross-attention mechanism augmented with dynamic weight sequence to integrate extracted chord information into song generations and reduce frame-level flaws, and propose a novel model termed Chord-Conditioned Song Generator (CSG) based on it. Experimental evidence demonstrates our proposed method outperforms other approaches in terms of musical performance and control precision of generated songs.

[AI-40] Enhancing Long Video Understanding via Hierarchical Event-Based Memory

链接: https://arxiv.org/abs/2409.06299
作者: Dingxin Cheng,Mingda Li,Jingyu Liu,Yongxin Guo,Bin Jiang,Qingbin Liu,Xi Chen,Bo Zhao
关键词-EN: integrating visual foundation, attracted widespread attention, visual foundation models, large language models, integrating visual
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, integrating visual foundation models into large language models (LLMs) to form video understanding systems has attracted widespread attention. Most of the existing models compress diverse semantic information within the whole video and feed it into LLMs for content comprehension. While this method excels in short video understanding, it may result in a blend of multiple event information in long videos due to coarse compression, which causes information redundancy. Consequently, the semantics of key events might be obscured within the vast information that hinders the model’s understanding capabilities. To address this issue, we propose a Hierarchical Event-based Memory-enhanced LLM (HEM-LLM) for better understanding of long videos. Firstly, we design a novel adaptive sequence segmentation scheme to divide multiple events within long videos. In this way, we can perform individual memory modeling for each event to establish intra-event contextual connections, thereby reducing information redundancy. Secondly, while modeling current event, we compress and inject the information of the previous event to enhance the long-term inter-event dependencies in videos. Finally, we perform extensive experiments on various video understanding tasks and the results show that our model achieves state-of-the-art performances.

[AI-41] User Preferences for Large Language Model versus Template-Based Explanations of Movie Recommendations: A Pilot Study

链接: https://arxiv.org/abs/2409.06297
作者: Julien Albert,Martin Balfroid,Miriam Doh,Jeremie Bogaert,Luca La Fisca,Liesbet De Vos,Bryan Renard,Vincent Stragier,Emmanuel Jean
关键词-EN: streaming platforms, online shopping, shopping to streaming, explanations, Recommender systems
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Presented to the Dutch-Belgian Workshop on Recommender Systems 2023 (14-15 December, 2023 - Antwerp, Belgium)

点击查看摘要

Abstract:Recommender systems have become integral to our digital experiences, from online shopping to streaming platforms. Still, the rationale behind their suggestions often remains opaque to users. While some systems employ a graph-based approach, offering inherent explainability through paths associating recommended items and seed items, non-experts could not easily understand these explanations. A popular alternative is to convert graph-based explanations into textual ones using a template and an algorithm, which we denote here as ‘‘template-based’’ explanations. Yet, these can sometimes come across as impersonal or uninspiring. A novel method would be to employ large language models (LLMs) for this purpose, which we denote as ‘‘LLM-based’’. To assess the effectiveness of LLMs in generating more resonant explanations, we conducted a pilot study with 25 participants. They were presented with three explanations: (1) traditional template-based, (2) LLM-based rephrasing of the template output, and (3) purely LLM-based explanations derived from the graph-based explanations. Although subject to high variance, preliminary findings suggest that LLM-based explanations may provide a richer and more engaging user experience, further aligning with user expectations. This study sheds light on the potential limitations of current explanation methods and offers promising directions for leveraging large language models to improve user satisfaction and trust in recommender systems.

[AI-42] Catch Me if You Can: Detecting Unauthorized Data Use in Deep Learning Models

链接: https://arxiv.org/abs/2409.06280
作者: Zitao Chen,Karthik Pattabiraman
关键词-EN: deep learning, rise of deep, surging demand, incentivizes the creators, data
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rise of deep learning (DL) has led to a surging demand for training data, which incentivizes the creators of DL models to trawl through the Internet for training materials. Meanwhile, users often have limited control over whether their data (e.g., facial images) are used to train DL models without their consent, which has engendered pressing concerns. This work proposes MembershipTracker, a practical data provenance tool that can empower ordinary users to take agency in detecting the unauthorized use of their data in training DL models. We view tracing data provenance through the lens of membership inference (MI). MembershipTracker consists of a lightweight data marking component to mark the target data with small and targeted changes, which can be strongly memorized by the model trained on them; and a specialized MI-based verification process to audit whether the model exhibits strong memorization on the target samples. Overall, MembershipTracker only requires the users to mark a small fraction of data (0.005% to 0.1% in proportion to the training set), and it enables the users to reliably detect the unauthorized use of their data (average 0% FPR@100% TPR). We show that MembershipTracker is highly effective across various settings, including industry-scale training on the full-size ImageNet-1k dataset. We finally evaluate MembershipTracker under multiple classes of countermeasures. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.06280 [cs.CR] (or arXiv:2409.06280v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2409.06280 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-43] Ferret: Federated Full-Parameter Tuning at Scale for Large Language Models

链接: https://arxiv.org/abs/2409.06277
作者: Yao Shu,Wenyang Hu,See-Kiong Ng,Bryan Kian Hsiang Low,Fei Richard Yu
关键词-EN: Large Language Models, Large Language, numerous real-world applications, Language Models, model accuracy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become indispensable in numerous real-world applications. Unfortunately, fine-tuning these models at scale, especially in federated settings where data privacy and communication efficiency are critical, presents significant challenges. Existing methods often resort to parameter-efficient fine-tuning (PEFT) to mitigate communication overhead, but this typically comes at the cost of model accuracy. To address these limitations, we propose federated full-parameter tuning at scale for LLMs (Ferret), the first first-order method with shared randomness to enable scalable full-parameter tuning of LLMs across decentralized data sources while maintaining competitive model accuracy. Ferret accomplishes this through three aspects: (1) it employs widely applied first-order methods for efficient local updates; (2) it projects these updates into a low-dimensional space to considerably reduce communication overhead; and (3) it reconstructs local updates from this low-dimensional space with shared randomness to facilitate effective full-parameter global aggregation, ensuring fast convergence and competitive final performance. Our rigorous theoretical analyses and insights along with extensive experiments, show that Ferret significantly enhances the scalability of existing federated full-parameter tuning approaches by achieving high computational efficiency, reduced communication overhead, and fast convergence, all while maintaining competitive model accuracy. Our implementation is available at this https URL.

[AI-44] owards Robust Uncertainty-Aware Incomplete Multi-View Classification

链接: https://arxiv.org/abs/2409.06270
作者: Mulin Chen,Haojian Huang,Qiang Li
关键词-EN: Evidential Deep Learning, Handling incomplete data, compromise uncertainty estimation, Progressive Learning Network, Existing Evidential Deep
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Ongoing work: 9 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Handling incomplete data in multi-view classification is challenging, especially when traditional imputation methods introduce biases that compromise uncertainty estimation. Existing Evidential Deep Learning (EDL) based approaches attempt to address these issues, but they often struggle with conflicting evidence due to the limitations of the Dempster-Shafer combination rule, leading to unreliable decisions. To address these challenges, we propose the Alternating Progressive Learning Network (APLN), specifically designed to enhance EDL-based methods in incomplete MVC scenarios. Our approach mitigates bias from corrupted observed data by first applying coarse imputation, followed by mapping the data to a latent space. In this latent space, we progressively learn an evidence distribution aligned with the target domain, incorporating uncertainty considerations through EDL. Additionally, we introduce a conflict-aware Dempster-Shafer combination rule (DSCR) to better handle conflicting evidence. By sampling from the learned distribution, we optimize the latent representations of missing views, reducing bias and enhancing decision-making robustness. Extensive experiments demonstrate that APLN, combined with DSCR, significantly outperforms traditional methods, particularly in environments characterized by high uncertainty and conflicting evidence, establishing it as a promising solution for incomplete multi-view classification.

[AI-45] Keyword-Aware ASR Error Augmentation for Robust Dialogue State Tracking

链接: https://arxiv.org/abs/2409.06263
作者: Jihyun Lee,Solee Im,Wonjun Lee,Gary Geunbae Lee
关键词-EN: Dialogue State Tracking, State Tracking, identifying important information, Automatic Speech Recognition, task-oriented dialogue systems
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-46] DiPT: Enhancing LLM reasoning through diversified perspective-taking

链接: https://arxiv.org/abs/2409.06241
作者: Hoang Anh Just,Mahavir Dabas,Lifu Huang,Ming Jin,Ruoxi Jia
关键词-EN: improving language model, reasoning typically explores, prone to errors, work on improving, improving language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: LLM Reasoning with Perspectives, Preprint

点击查看摘要

Abstract:Existing work on improving language model reasoning typically explores a single solution path, which can be prone to errors. Inspired by perspective-taking in social studies, this paper introduces DiPT, a novel approach that complements current reasoning methods by explicitly incorporating diversified viewpoints. This approach allows the model to gain a deeper understanding of the problem’s context and identify the most effective solution path during the inference stage. Additionally, it provides a general data-centric AI recipe for augmenting existing data to improve their quality for fine-tuning. Our empirical results demonstrate that DiPT can be flexibly integrated into existing methods that focus on a single reasoning approach, enhancing their reasoning performance and stability when presented with paraphrased problems. Furthermore, we illustrate improved context understanding by maintaining the model’s safe outputs against “jailbreaking” prompts intentionally designed to bypass safeguards built into deployed models. Lastly, we show that fine-tuning with data enriched with diverse perspectives can boost the reasoning capabilities of the model compared to fine-tuning with raw data alone. Comments: LLM Reasoning with Perspectives, Preprint Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.06241 [cs.LG] (or arXiv:2409.06241v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.06241 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-47] CerviXpert: A Multi-Structural Convolutional Neural Network for Predicting Cervix Type and Cervical Cell Abnormalities

链接: https://arxiv.org/abs/2409.06220
作者: Rashik Shahriar Akash,Radiful Islam,S.M. Saiful Islam Badhon,K. S. M. Tozammel Hossain
关键词-EN: significantly higher survival, higher survival rate, cancer affects millions, diagnosed early, affects millions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cervical cancer affects millions of women worldwide and has a significantly higher survival rate when diagnosed early. Pap smears and cervical biopsies are vital screening tools for detecting such cancer. However, the success of these screening processes depends on the skills of cytologists. A recent trend in diagnostic cytology is to apply machine-learning-based models to classify cancer using cell images. These automated models have been shown to perform just as well as, or even better than, expert cytologists. Some notable methods for classifying cervix cancers include ResNet50, VGG16, MobileNetV2, and InceptionV3, based on deep convolutional neural networks (CNN). However, these methods are computationally expensive. We present CerviXpert, a multi-structural Convolutional Neural Network, to identify cervix cancer. We perform extensive experiments on a publicly available dataset, SiPaKMeD, to show the efficacy of our method. CerviXpert presents a promising solution for efficient cervical cancer screening and diagnosis by striking a balance between accuracy and practical feasibility.

[AI-48] owards Generalizable Scene Change Detection

链接: https://arxiv.org/abs/2409.06214
作者: Jaewoo Kim,Uehwan Kim
关键词-EN: Scene Change Detection, Generalizable Scene Change, Scene Change, Change Detection, Change Detection Framework
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 7 pages, 5 figures

点击查看摘要

Abstract:Scene Change Detection (SCD) is vital for applications such as visual surveillance and mobile robotics. However, current SCD methods exhibit a bias to the temporal order of training datasets and limited performance on unseen domains; coventional SCD benchmarks are not able to evaluate generalization or temporal consistency. To tackle these limitations, we introduce a Generalizable Scene Change Detection Framework (GeSCF) in this work. The proposed GeSCF leverages localized semantics of a foundation model without any re-training or fine-tuning – for generalization over unseen domains. Specifically, we design an adaptive thresholding of the similarity distribution derived from facets of the pre-trained foundation model to generate initial pseudo-change mask. We further utilize Segment Anything Model’s (SAM) class-agnostic masks to refine pseudo-masks. Moreover, our proposed framework maintains commutative operations in all settings to ensure complete temporal consistency. Finally, we define new metrics, evaluation dataset, and evaluation protocol for Generalizable Scene Change Detection (GeSCD). Extensive experiments demonstrate that GeSCF excels across diverse and challenging environments – establishing a new benchmark for SCD performance.

[AI-49] Adaptive Transformer Modelling of Density Function for Nonparametric Survival Analysis

链接: https://arxiv.org/abs/2409.06209
作者: Xin Zhang,Deval Mehta,Yanan Hu,Chao Zhu,David Darby,Zhen Yu,Daniel Merlo,Melissa Gresle,Anneke Van Der Walt,Helmut Butzkueven,Zongyuan Ge
关键词-EN: Survival analysis holds, engineering and healthcare, diverse disciplines, analysis holds, holds a crucial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Survival analysis holds a crucial role across diverse disciplines, such as economics, engineering and healthcare. It empowers researchers to analyze both time-invariant and time-varying data, encompassing phenomena like customer churn, material degradation and various medical outcomes. Given the complexity and heterogeneity of such data, recent endeavors have demonstrated successful integration of deep learning methodologies to address limitations in conventional statistical approaches. However, current methods typically involve cluttered probability distribution function (PDF), have lower sensitivity in censoring prediction, only model static datasets, or only rely on recurrent neural networks for dynamic modelling. In this paper, we propose a novel survival regression method capable of producing high-quality unimodal PDFs without any prior distribution assumption, by optimizing novel Margin-Mean-Variance loss and leveraging the flexibility of Transformer to handle both temporal and non-temporal data, coined UniSurv. Extensive experiments on several datasets demonstrate that UniSurv places a significantly higher emphasis on censoring compared to other methods.

[AI-50] NOVI : Chatbot System for University Novice with BERT and LLMs

链接: https://arxiv.org/abs/2409.06192
作者: Yoonji Nam,TaeWoong Seo,Gyeongcheol Shin,Sangji Lee,JaeEun Im
关键词-EN: chatbot system based, mitigate the difficulties, university life, chatbot system, system based
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-51] Can Large Language Models Unlock Novel Scientific Research Ideas?

链接: https://arxiv.org/abs/2409.06185
作者: Sandeep Kumar,Tirthankar Ghosal,Vinayak Goyal,Asif Ekbal
关键词-EN: future research ideas, research ideas, Artificial Intelligence, Large Language Models, future research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 24 pages, 12 figures, 6 tables

点击查看摘要

[AI-52] Larger Language Models Dont Care How You Think: Why Chain-of-Thought Prompting Fails in Subjective Tasks

链接: https://arxiv.org/abs/2409.06173
作者: Georgios Chochlakis,Niyantha Maruthu Pandiyan,Kristina Lerman,Shrikanth Narayanan
关键词-EN: Large Language Models, performing natural language, natural language tasks, Large Language, gradient-based methods
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 5 pages, 2 figures, 1 table

点击查看摘要

[AI-53] MCDGLN: Masked Connection-based Dynamic Graph Learning Network for Autism Spectrum Disorder

链接: https://arxiv.org/abs/2409.06163
作者: Peng Wang,Xin Wen,Ruochen Cao,Chengxin Gao,Yanrong Hao,Rui Cao
关键词-EN: complex physiological processes, Autism Spectrum Disorder, neurodevelopmental disorder characterized, Autism Spectrum, physiological processes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages, 7 figures

点击查看摘要

Abstract:Autism Spectrum Disorder (ASD) is a neurodevelopmental disorder characterized by complex physiological processes. Previous research has predominantly focused on static cerebral interactions, often neglecting the brain’s dynamic nature and the challenges posed by network noise. To address these gaps, we introduce the Masked Connection-based Dynamic Graph Learning Network (MCDGLN). Our approach first segments BOLD signals using sliding temporal windows to capture dynamic brain characteristics. We then employ a specialized weighted edge aggregation (WEA) module, which uses the cross convolution with channel-wise element-wise convolutional kernel, to integrate dynamic functional connectivity and to isolating task-relevant connections. This is followed by topological feature extraction via a hierarchical graph convolutional network (HGCN), with key attributes highlighted by a self-attention module. Crucially, we refine static functional connections using a customized task-specific mask, reducing noise and pruning irrelevant links. The attention-based connection encoder (ACE) then enhances critical connections and compresses static features. The combined features are subsequently used for classification. Applied to the Autism Brain Imaging Data Exchange I (ABIDE I) dataset, our framework achieves a 73.3% classification accuracy between ASD and Typical Control (TC) groups among 1,035 subjects. The pivotal roles of WEA and ACE in refining connectivity and enhancing classification accuracy underscore their importance in capturing ASD-specific features, offering new insights into the disorder.

[AI-54] Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis

链接: https://arxiv.org/abs/2409.06135
作者: Qi Yang,Binjie Mao,Zili Wang,Xing Nie,Pengfei Gao,Ying Guo,Cheng Zhen,Pengfei Yan,Shiming Xiang
关键词-EN: daily sound effects, auditory experience, term commonly, addition of daily, effects to silent
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
*备注: 14 pages, 11 figures

点击查看摘要

Abstract:Foley is a term commonly used in filmmaking, referring to the addition of daily sound effects to silent films or videos to enhance the auditory experience. Video-to-Audio (V2A), as a particular type of automatic foley task, presents inherent challenges related to audio-visual synchronization. These challenges encompass maintaining the content consistency between the input video and the generated audio, as well as the alignment of temporal and loudness properties within the video. To address these issues, we construct a controllable video-to-audio synthesis model, termed Draw an Audio, which supports multiple input instructions through drawn masks and loudness signals. To ensure content consistency between the synthesized audio and target video, we introduce the Mask-Attention Module (MAM), which employs masked video instruction to enable the model to focus on regions of interest. Additionally, we implement the Time-Loudness Module (TLM), which uses an auxiliary loudness signal to ensure the synthesis of sound that aligns with the video in both loudness and temporal dimensions. Furthermore, we have extended a large-scale V2A dataset, named VGGSound-Caption, by annotating caption prompts. Extensive experiments on challenging benchmarks across two large-scale V2A datasets verify Draw an Audio achieves the state-of-the-art. Project page: this https URL.

[AI-55] Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn Focus and Review

链接: https://arxiv.org/abs/2409.06131
作者: Neha Prakriya,Jui-Nan Yen,Cho-Jui Hsieh,Jason Cong
关键词-EN: Large Language Model, randomly sampled data, pretraining traditionally relies, Large Language, sampled data blocks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-56] On the Weaknesses of Backdoor-based Model Watermarking: An Information-theoretic Perspective

链接: https://arxiv.org/abs/2409.06130
作者: Aoting Hu,Yanzhi Chen,Renjie Xie,Adrian Weller
关键词-EN: Safeguarding the intellectual, machine learning models, machine learning, intellectual property, pressing concern
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Safeguarding the intellectual property of machine learning models has emerged as a pressing concern in AI security. Model watermarking is a powerful technique for protecting ownership of machine learning models, yet its reliability has been recently challenged by recent watermark removal attacks. In this work, we investigate why existing watermark embedding techniques particularly those based on backdooring are vulnerable. Through an information-theoretic analysis, we show that the resilience of watermarking against erasure attacks hinges on the choice of trigger-set samples, where current uses of out-distribution trigger-set are inherently vulnerable to white-box adversaries. Based on this discovery, we propose a novel model watermarking scheme, In-distribution Watermark Embedding (IWE), to overcome the limitations of existing method. To further minimise the gap to clean models, we analyze the role of logits as watermark information carriers and propose a new approach to better conceal watermark information within the logits. Experiments on real-world datasets including CIFAR-100 and Caltech-101 demonstrate that our method robustly defends against various adversaries with negligible accuracy loss ( 0.1%).

[AI-57] Case Study: Leveraging GenAI to Build AI-based Surrogates and Regressors for Modeling Radio Frequency Heating in Fusion Energy Science

链接: https://arxiv.org/abs/2409.06122
作者: E. Wes Bethel,Vianna Cramer,Alexander del Rio,Lothar Narins,Chris Pestano,Satvik Verma,Erick Arias,Nicola Bertelli,Talita Perciano,Syun’ichi Shiraiwa,Álvaro Sánchez Villar,Greg Wallace,John C. Wright
关键词-EN: fusion energy research, detailed case study, energy research, work presents, presents a detailed
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This work presents a detailed case study on using Generative AI (GenAI) to develop AI surrogates for simulation models in fusion energy research. The scope includes the methodology, implementation, and results of using GenAI to assist in model development and optimization, comparing these results with previous manually developed models.

[AI-58] PaRCE: Probabilistic and Reconstruction-Based Competency Estimation for Safe Navigation Under Perception Uncertainty

链接: https://arxiv.org/abs/2409.06111
作者: Sara Pohland,Claire Tomlin
关键词-EN: Perception-based navigation systems, unmanned ground vehicle, traditional depth-based navigation, Perception-based navigation, complex terrains
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Perception-based navigation systems are useful for unmanned ground vehicle (UGV) navigation in complex terrains, where traditional depth-based navigation schemes are insufficient. However, these data-driven methods are highly dependent on their training data and can fail in surprising and dramatic ways with little warning. To ensure the safety of the vehicle and the surrounding environment, it is imperative that the navigation system is able to recognize the predictive uncertainty of the perception model and respond safely and effectively in the face of uncertainty. In an effort to enable safe navigation under perception uncertainty, we develop a probabilistic and reconstruction-based competency estimation (PaRCE) method to estimate the model’s level of familiarity with an input image as a whole and with specific regions in the image. We find that the overall competency score can correctly predict correctly classified, misclassified, and out-of-distribution (OOD) samples. We also confirm that the regional competency maps can accurately distinguish between familiar and unfamiliar regions across images. We then use this competency information to develop a planning and control scheme that enables effective navigation while maintaining a low probability of error. We find that the competency-aware scheme greatly reduces the number of collisions with unfamiliar obstacles, compared to a baseline controller with no competency awareness. Furthermore, the regional competency information is very valuable in enabling efficient navigation.

[AI-59] Doppelg"angers Watch: A Split Objective Approach to Large Language Models

链接: https://arxiv.org/abs/2409.06107
作者: Shervin Ghasemlou,Ashish Katiyar,Aparajita Saraf,Seungwhan Moon,Mangesh Pujari,Pinar Donmez,Babak Damavandi,Anuj Kumar
关键词-EN: large language models, separate supervision signals, underlying language model, core capability, investigate the problem
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-60] Latent Diffusion Bridges for Unsupervised Musical Audio Timbre Transfer

链接: https://arxiv.org/abs/2409.06096
作者: Michele Mancusi,Yurii Halychansky,Kin Wai Cheuk,Chieh-Hsin Lai,Stefan Uhlich,Junghyun Koo,Marco A. Martínez-Ramírez,Wei-Hsiang Liao,Giorgio Fabbro,Yuhki Mitsufuji
关键词-EN: Music timbre transfer, Gaussian prior, Music timbre, Gaussian Flow Bridges, melodic structure
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Music timbre transfer is a challenging task that involves modifying the timbral characteristics of an audio signal while preserving its melodic structure. In this paper, we propose a novel method based on dual diffusion bridges, trained using the CocoChorales Dataset, which consists of unpaired monophonic single-instrument audio data. Each diffusion model is trained on a specific instrument with a Gaussian prior. During inference, a model is designated as the source model to map the input audio to its corresponding Gaussian prior, and another model is designated as the target model to reconstruct the target audio from this Gaussian prior, thereby facilitating timbre transfer. We compare our approach against existing unsupervised timbre transfer models such as VAEGAN and Gaussian Flow Bridges (GFB). Experimental results demonstrate that our method achieves both better Fréchet Audio Distance (FAD) and melody preservation, as reflected by lower pitch distances (DPD) compared to VAEGAN and GFB. Additionally, we discover that the noise level from the Gaussian prior, \sigma , can be adjusted to control the degree of melody preservation and amount of timbre transferred.

[AI-61] Scalable Multitask Learning Using Gradient-based Estimation of Task Affinity

链接: https://arxiv.org/abs/2409.06091
作者: Dongyue Li,Aneesh Sharma,Hongyang R. Zhang
关键词-EN: Multitask learning, task, widely used paradigm, applications ranging, task affinity
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
*备注: 16 pages

点击查看摘要

Abstract:Multitask learning is a widely used paradigm for training models on diverse tasks, with applications ranging from graph neural networks to language model fine-tuning. Since tasks may interfere with each other, a key notion for modeling their relationships is task affinity. This includes pairwise task affinity, computed among pairs of tasks, and higher-order affinity, computed among subsets of tasks. Naively computing either of them requires repeatedly training on data from various task combinations, which is computationally intensive. We present a new algorithm Grad-TAG that can estimate task affinities without this repeated training. The key idea of Grad-TAG is to train a “base” model for all tasks and then use a linearization technique to estimate the loss of the model for a specific task combination. The linearization works by computing a gradient-based approximation of the loss, using low-dimensional projections of gradients as features in a logistic regression to predict labels for the task combination. We show that the linearized model can provably approximate the loss when the gradient-based approximation is accurate, and also empirically verify that on several large models. Then, given the estimated task affinity, we design a semi-definite program for clustering similar tasks by maximizing the average density of clusters. We evaluate Grad-TAG’s performance across seven datasets, including multi-label classification on graphs, and instruction fine-tuning of language models. Our task affinity estimates are within 2.7% distance to the true affinities while needing only 3% of FLOPs in full training. On our largest graph with 21M edges and 500 labeling tasks, our algorithm delivers estimates within 5% distance to the true affinities, using only 112 GPU hours. Our results show that Grad-TAG achieves excellent performance and runtime tradeoffs compared to existing approaches. Comments: 16 pages Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI); Machine Learning (stat.ML) Cite as: arXiv:2409.06091 [cs.LG] (or arXiv:2409.06091v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.06091 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3637528.3671835 Focus to learn more DOI(s) linking to related resources

[AI-62] MTLSO: A Multi-Task Learning Approach for Logic Synthesis Optimization

链接: https://arxiv.org/abs/2409.06077
作者: Faezeh Faez,Raika Karimi,Yingxue Zhang,Xing Li,Lei Chen,Mingxuan Yuan,Mahdi Biparva
关键词-EN: Electronic Design Automation, Electronic Design, Design Automation, key EDA stage, recently benefited
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Electronic Design Automation (EDA) is essential for IC design and has recently benefited from AI-based techniques to improve efficiency. Logic synthesis, a key EDA stage, transforms high-level hardware descriptions into optimized netlists. Recent research has employed machine learning to predict Quality of Results (QoR) for pairs of And-Inverter Graphs (AIGs) and synthesis recipes. However, the severe scarcity of data due to a very limited number of available AIGs results in overfitting, significantly hindering performance. Additionally, the complexity and large number of nodes in AIGs make plain GNNs less effective for learning expressive graph-level representations. To tackle these challenges, we propose MTLSO - a Multi-Task Learning approach for Logic Synthesis Optimization. On one hand, it maximizes the use of limited data by training the model across different tasks. This includes introducing an auxiliary task of binary multi-label graph classification alongside the primary regression task, allowing the model to benefit from diverse supervision sources. On the other hand, we employ a hierarchical graph representation learning strategy to improve the model’s capacity for learning expressive graph-level representations of large AIGs, surpassing traditional plain GNNs. Extensive experiments across multiple datasets and against state-of-the-art baselines demonstrate the superiority of our method, achieving an average performance gain of 8.22% for delay and 5.95% for area.

[AI-63] Privacy-Preserving Data Linkage Across Private and Public Datasets for Collaborative Agriculture Research

链接: https://arxiv.org/abs/2409.06069
作者: Osama Zafar,Rosemarie Santa Gonzalez,Gabriel Wilkins,Alfonso Morales,Erman Ayday
关键词-EN: enhance crop yield, Digital agriculture leverages, disease resilience, crop yield, soil health
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Digital agriculture leverages technology to enhance crop yield, disease resilience, and soil health, playing a critical role in agricultural research. However, it raises privacy concerns such as adverse pricing, price discrimination, higher insurance costs, and manipulation of resources, deterring farm operators from sharing data due to potential misuse. This study introduces a privacy-preserving framework that addresses these risks while allowing secure data sharing for digital agriculture. Our framework enables comprehensive data analysis while protecting privacy. It allows stakeholders to harness research-driven policies that link public and private datasets. The proposed algorithm achieves this by: (1) identifying similar farmers based on private datasets, (2) providing aggregate information like time and location, (3) determining trends in price and product availability, and (4) correlating trends with public policy data, such as food insecurity statistics. We validate the framework with real-world Farmer’s Market datasets, demonstrating its efficacy through machine learning models trained on linked privacy-preserved data. The results support policymakers and researchers in addressing food insecurity and pricing issues. This work significantly contributes to digital agriculture by providing a secure method for integrating and analyzing data, driving advancements in agricultural technology and development.

[AI-64] MLLM-FL: Multimodal Large Language Model Assisted Federated Learning on Heterogeneous and Long-tailed Data

链接: https://arxiv.org/abs/2409.06067
作者: Jianyi Zhang,Hao Frank Yang,Ang Li,Xin Guo,Pu Wang,Haiming Wang,Yiran Chen,Hai Li
关键词-EN: multimodal large language, large language models, Assisted Federated Learning, Previous studies, federated learning
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Previous studies on federated learning (FL) often encounter performance degradation due to data heterogeneity among different clients. In light of the recent advances in multimodal large language models (MLLMs), such as GPT-4v and LLaVA, which demonstrate their exceptional proficiency in multimodal tasks, such as image captioning and multimodal question answering. We introduce a novel federated learning framework, named Multimodal Large Language Model Assisted Federated Learning (MLLM-FL), which which employs powerful MLLMs at the server end to address the heterogeneous and long-tailed challenges. Owing to the advanced cross-modality representation capabilities and the extensive open-vocabulary prior knowledge of MLLMs, our framework is adept at harnessing the extensive, yet previously underexploited, open-source data accessible from websites and powerful server-side computational resources. Hence, the MLLM-FL not only enhances the performance but also avoids increasing the risk of privacy leakage and the computational burden on local devices, distinguishing it from prior methodologies. Our framework has three key stages. Initially, prior to local training on local datasets of clients, we conduct global visual-text pretraining of the model. This pretraining is facilitated by utilizing the extensive open-source data available online, with the assistance of multimodal large language models. Subsequently, the pretrained model is distributed among various clients for local training. Finally, once the locally trained models are transmitted back to the server, a global alignment is carried out under the supervision of MLLMs to further enhance the performance. Experimental evaluations on established benchmarks, show that our framework delivers promising performance in the typical scenarios with data heterogeneity and long-tail distribution across different clients in FL.

[AI-65] SongCreator: Lyrics-based Universal Song Generation

链接: https://arxiv.org/abs/2409.06029
作者: Shun Lei,Yixuan Zhou,Boshi Tang,Max W. Y. Lam,Feng Liu,Hangyu Liu,Jingcheng Wu,Shiyin Kang,Zhiyong Wu,Helen Meng
关键词-EN: embodying human intelligence, human culture, embodying human, integral part, human intelligence
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: work in progress

点击查看摘要

Abstract:Music is an integral part of human culture, embodying human intelligence and creativity, of which songs compose an essential part. While various aspects of song generation have been explored by previous works, such as singing voice, vocal composition and instrumental arrangement, etc., generating songs with both vocals and accompaniment given lyrics remains a significant challenge, hindering the application of music generation models in the real world. In this light, we propose SongCreator, a song-generation system designed to tackle this challenge. The model features two novel designs: a meticulously designed dual-sequence language model (DSLM) to capture the information of vocals and accompaniment for song generation, and an additional attention mask strategy for DSLM, which allows our model to understand, generate and edit songs, making it suitable for various song-related generation tasks. Extensive experiments demonstrate the effectiveness of SongCreator by achieving state-of-the-art or competitive performances on all eight tasks. Notably, it surpasses previous works by a large margin in lyrics-to-song and lyrics-to-vocals. Additionally, it is able to independently control the acoustic conditions of the vocals and accompaniment in the generated song through different prompts, exhibiting its potential applicability. Our samples are available at this https URL.

[AI-66] Deep Generative Model for Mechanical System Configuration Design

链接: https://arxiv.org/abs/2409.06016
作者: Yasaman Etesam,Hyunmin Cheong,Mohammadmehdi Ataei,Pradeep Kumar Jayaraman
关键词-EN: made remarkable progress, made remarkable, remarkable progress, progress in addressing, design
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative AI has made remarkable progress in addressing various design challenges. One prominent area where generative AI could bring significant value is in engineering design. In particular, selecting an optimal set of components and their interfaces to create a mechanical system that meets design requirements is one of the most challenging and time-consuming tasks for engineers. This configuration design task is inherently challenging due to its categorical nature, multiple design requirements a solution must satisfy, and the reliance on physics simulations for evaluating potential solutions. These characteristics entail solving a combinatorial optimization problem with multiple constraints involving black-box functions. To address this challenge, we propose a deep generative model to predict the optimal combination of components and interfaces for a given design problem. To demonstrate our approach, we solve a gear train synthesis problem by first creating a synthetic dataset using a grammar, a parts catalogue, and a physics simulator. We then train a Transformer using this dataset, named GearFormer, which can not only generate quality solutions on its own, but also augment search methods such as an evolutionary algorithm and Monte Carlo tree search. We show that GearFormer outperforms such search methods on their own in terms of satisfying the specified design requirements with orders of magnitude faster generation time. Additionally, we showcase the benefit of hybrid methods that leverage both GearFormer and search methods, which further improve the quality of the solutions.

[AI-67] MessIRve: A Large-Scale Spanish Information Retrieval Dataset

链接: https://arxiv.org/abs/2409.05994
作者: Francisco Valentini,Viviana Cotik,Damián Furman,Ivan Bercovich,Edgar Altszyler,Juan Manuel Pérez
关键词-EN: user query, finding relevant documents, Spanish, task of finding, Information retrieval
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-68] A Comprehensive Comparison Between ANNs and KANs For Classifying EEG Alzheimers Data

链接: https://arxiv.org/abs/2409.05989
作者: Akshay Sunkara,Sriram Sattiraju,Aakarshan Kumar,Zaryab Kanjiani,Himesh Anumala
关键词-EN: incurable cognitive condition, Alzheimer Disease, Alzheimer, people globally, predicting Alzheimer Disease
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Alzheimer’s Disease is an incurable cognitive condition that affects thousands of people globally. While some diagnostic methods exist for Alzheimer’s Disease, many of these methods cannot detect Alzheimer’s in its earlier stages. Recently, researchers have explored the use of Electroencephalogram (EEG) technology for diagnosing Alzheimer’s. EEG is a noninvasive method of recording the brain’s electrical signals, and EEG data has shown distinct differences between patients with and without Alzheimer’s. In the past, Artificial Neural Networks (ANNs) have been used to predict Alzheimer’s from EEG data, but these models sometimes produce false positive diagnoses. This study aims to compare losses between ANNs and Kolmogorov-Arnold Networks (KANs) across multiple types of epochs, learning rates, and nodes. The results show that across these different parameters, ANNs are more accurate in predicting Alzheimer’s Disease from EEG signals.

[AI-69] Alt-MoE: Multimodal Alignment via Alternating Optimization of Multi-directional MoE with Unimodal Models

链接: https://arxiv.org/abs/2409.05929
作者: Hongyang Lei,Xiaolong Cheng,Dan Wang,Qi Qin,Huazhen Huang,Yetao Wu,Qingqing Gu,Zhonglin Jiang,Yong Chen,Luo Ji
关键词-EN: Recent Large Multi-Modal, Recent Large, made significant advancements, Large Multi-Modal Models, Large Multi-Modal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: work in progress

点击查看摘要

Abstract:Recent Large Multi-Modal Models (LMMs) have made significant advancements in multi-modal alignment by employing lightweight connection modules to facilitate the representation and fusion of knowledge from existing pre-trained uni-modal models. However, these methods still rely on modality-specific and direction-specific connectors, leading to compartmentalized knowledge representations and reduced computational efficiency, which limits the model’s ability to form unified multi-modal representations. To address these issues, we introduce a novel training framework, Alt-MoE, which employs the Mixture of Experts (MoE) as a unified multi-directional connector across modalities, and employs a multi-step sequential alternating unidirectional alignment strategy, which converges to bidirectional alignment over iterations. The extensive empirical studies revealed the following key points: 1) Alt-MoE achieves competitive results by integrating diverse knowledge representations from uni-modal models. This approach seamlessly fuses the specialized expertise of existing high-performance uni-modal models, effectively synthesizing their domain-specific knowledge into a cohesive multi-modal representation. 2) Alt-MoE efficiently scales to new tasks and modalities without altering its model architecture or training strategy. Furthermore, Alt-MoE operates in latent space, supporting vector pre-storage and real-time retrieval via lightweight multi-directional MoE, thereby facilitating massive data processing. Our methodology has been validated on several well-performing uni-modal models (LLAMA3, Qwen2, and DINOv2), achieving competitive results on a wide range of downstream tasks and datasets.

[AI-70] Assessing SPARQL capabilities of Large Language Models

链接: https://arxiv.org/abs/2409.05925
作者: Lars-Peter Meyer,Johannes Frey,Felix Brei,Natanael Arndt
关键词-EN: Large Language Models, offers significant synergistic, significant synergistic potential, SPARQL SELECT queries, Large Language
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: peer reviewed publication at NLP4KGc @ Semantics 2024, see this https URL

点击查看摘要

[AI-71] mathbbUSCD: Improving Code Generation of LLMs by Uncertainty-Aware Selective Contrastive Decoding

链接: https://arxiv.org/abs/2409.05923
作者: Shuai Wang,Liang Ding,Li Shen,Yong Luo,Zheng He,Wei Yu,Dacheng Tao
关键词-EN: Large language models, shown remarkable capabilities, Large language, output noise, language models
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 13pages,8 figures

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable capabilities in code generation. However, the effects of hallucinations (e.g., output noise) make it particularly challenging for LLMs to generate high-quality code in one pass. In this work, we propose a simple and effective \textbfuncertainty-aware \textbfselective \textbfcontrastive \textbfdecoding ( \mathbbUSCD ) mechanism to improve the quality of one-pass code generation in LLMs and reduce the impact of output noise. To be specific, we first elaborately designed a negative prompt (namely lame prompt) to output noise by removing input-output examples from the standard few-shot prompt. Our preliminary study shows that the Jensen-Shannon divergence (JS divergence) between token distribution uncertainty and the output noise is relatively low (approximately 0.25 ), indicating their high relevance. Then, we selectively eliminate output noise induced by lame prompts based on the uncertainty of the prediction distribution from the standard prompt. Notably, our proposed plug-and-play mechanism is an inference-only method, enjoying appealing flexibility. Extensive experiments on widely used benchmarks, e.g., HumanEval, MBPP, and MultiPL-E, upon several LLMs (i.e., Inocder-6b, CodeLlama-7b, WizardCoder-15b, StarCoder, and Llama2-7b), demonstrate that our proposed USCD significantly improves one-pass code generation, with an average \textitpass@ 1 scores increase of 16.59%. We will release code and data on GitHub.

[AI-72] STLLM-DF: A Spatial-Temporal Large Language Model with Diffusion for Enhanced Multi-Mode Traffic System Forecasting

链接: https://arxiv.org/abs/2409.05921
作者: Zhiqi Shao,Haoning Xi,Haohui Lu,Ze Wang,Michael G.H. Bell,Junbin Gao
关键词-EN: Intelligent Transportation Systems, advancement of Intelligent, handling diverse sequential, Intelligent Transportation, Large Language Model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 26 pages, 11 figures

点击查看摘要

Abstract:The rapid advancement of Intelligent Transportation Systems (ITS) presents challenges, particularly with missing data in multi-modal transportation and the complexity of handling diverse sequential tasks within a centralized framework. To address these issues, we propose the Spatial-Temporal Large Language Model Diffusion (STLLM-DF), an innovative model that leverages Denoising Diffusion Probabilistic Models (DDPMs) and Large Language Models (LLMs) to improve multi-task transportation prediction. The DDPM’s robust denoising capabilities enable it to recover underlying data patterns from noisy inputs, making it particularly effective in complex transportation systems. Meanwhile, the non-pretrained LLM dynamically adapts to spatial-temporal relationships within multi-modal networks, allowing the system to efficiently manage diverse transportation tasks in both long-term and short-term predictions. Extensive experiments demonstrate that STLLM-DF consistently outperforms existing models, achieving an average reduction of 2.40% in MAE, 4.50% in RMSE, and 1.51% in MAPE. This model significantly advances centralized ITS by enhancing predictive accuracy, robustness, and overall system performance across multiple tasks, thus paving the way for more effective spatio-temporal traffic forecasting through the integration of frozen transformer language models and diffusion techniques.

[AI-73] KModels: Unlocking AI for Business Applications

链接: https://arxiv.org/abs/2409.05919
作者: Roy Abitbol(1),Eyal Cohen(1),Muhammad Kanaan(1),Bhavna Agrawal(2),Yingjie Li(2),Anuradha Bhamidipaty(2),Erez Bilgory(1) ((1) IBM Research Israel, (2) IBM Research USA)
关键词-EN: continues to rapidly, rapidly advance, growing demand, demand to integrate, integrate AI capabilities
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As artificial intelligence (AI) continues to rapidly advance, there is a growing demand to integrate AI capabilities into existing business applications. However, a significant gap exists between the rapid progress in AI and how slowly AI is being embedded into business environments. Deploying well-performing lab models into production settings, especially in on-premise environments, often entails specialized expertise and imposes a heavy burden of model management, creating significant barriers to implementing AI models in real-world applications. KModels leverages proven libraries and platforms (Kubeflow Pipelines, KServe) to streamline AI adoption by supporting both AI developers and consumers. It allows model developers to focus solely on model development and share models as transportable units (Templates), abstracting away complex production deployment concerns. KModels enables AI consumers to eliminate the need for a dedicated data scientist, as the templates encapsulate most data science considerations while providing business-oriented control. This paper presents the architecture of KModels and the key decisions that shape it. We outline KModels’ main components as well as its interfaces. Furthermore, we explain how KModels is highly suited for on-premise deployment but can also be used in cloud environments. The efficacy of KModels is demonstrated through the successful deployment of three AI models within an existing Work Order Management system. These models operate in a client’s data center and are trained on local data, without data scientist intervention. One model improved the accuracy of Failure Code specification for work orders from 46% to 83%, showcasing the substantial benefit of accessible and localized AI solutions. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.05919 [cs.SE] (or arXiv:2409.05919v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2409.05919 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Roy Abitbol [view email] [v1] Sun, 8 Sep 2024 13:19:12 UTC (1,312 KB)

[AI-74] Programming Refusal with Conditional Activation Steering

链接: https://arxiv.org/abs/2409.05907
作者: Bruce W. Lee,Inkit Padhi,Karthikeyan Natesan Ramamurthy,Erik Miehling,Pierre Dognin,Manish Nagireddy,Amit Dhurandhar
关键词-EN: shown remarkable capabilities, behavior remains challenging, remarkable capabilities, remains challenging, activation steering
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[AI-75] Simplex-enabled Safe Continual Learning Machine

链接: https://arxiv.org/abs/2409.05898
作者: Yihao Cai,Hongpeng Cao,Yanbing Mao,Lui Sha,Marco Caccamo
关键词-EN: SeC-Learning Machine, Simplex-enabled safe continual, safety-critical autonomous systems, Simplex-enabled safe, paper proposes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This paper proposes the SeC-Learning Machine: Simplex-enabled safe continual learning for safety-critical autonomous systems. The SeC-learning machine is built on Simplex logic (that is, ``using simplicity to control complexity’') and physics-regulated deep reinforcement learning (Phy-DRL). The SeC-learning machine thus constitutes HP (high performance)-Student, HA (high assurance)-Teacher, and Coordinator. Specifically, the HP-Student is a pre-trained high-performance but not fully verified Phy-DRL, continuing to learn in a real plant to tune the action policy to be safe. In contrast, the HA-Teacher is a mission-reduced, physics-model-based, and verified design. As a complementary, HA-Teacher has two missions: backing up safety and correcting unsafe learning. The Coordinator triggers the interaction and the switch between HP-Student and HA-Teacher. Powered by the three interactive components, the SeC-learning machine can i) assure lifetime safety (i.e., safety guarantee in any continual-learning stage, regardless of HP-Student’s success or convergence), ii) address the Sim2Real gap, and iii) learn to tolerate unknown unknowns in real plants. The experiments on a cart-pole system and a real quadruped robot demonstrate the distinguished features of the SeC-learning machine, compared with continual learning built on state-of-the-art safe DRL frameworks with approaches to addressing the Sim2Real gap.

[AI-76] MA-CDMR: An Intelligent Cross-domain Multicast Routing Method based on Multiagent Deep Reinforcement Learning in Multi-domain SDWN

链接: https://arxiv.org/abs/2409.05888
作者: Miao Ye,Hongwen Hu,Xiaoli Wang,Yuping Wang,Yong Wang,Wen Peng,Jihao Zheng
关键词-EN: cross-domain multicast routing, NP-hard optimization problem, cross-domain multicast, classic NP-hard optimization, multicast routing problem
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The cross-domain multicast routing problem in a software-defined wireless network with multiple controllers is a classic NP-hard optimization problem. As the network size increases, designing and implementing cross-domain multicast routing paths in the network requires not only designing efficient solution algorithms to obtain the optimal cross-domain multicast tree but also ensuring the timely and flexible acquisition and maintenance of global network state information. However, existing solutions have a limited ability to sense the network traffic state, affecting the quality of service of multicast services. In addition, these methods have difficulty adapting to the highly dynamically changing network states and have slow convergence speeds. To this end, this paper aims to design and implement a multiagent deep reinforcement learning based cross-domain multicast routing method for SDWN with multicontroller domains. First, a multicontroller communication mechanism and a multicast group management module are designed to transfer and synchronize network information between different control domains of the SDWN, thus effectively managing the joining and classification of members in the cross-domain multicast group. Second, a theoretical analysis and proof show that the optimal cross-domain multicast tree includes an interdomain multicast tree and an intradomain multicast tree. An agent is established for each controller, and a cooperation mechanism between multiple agents is designed to effectively optimize cross-domain multicast routing and ensure consistency and validity in the representation of network state information for cross-domain multicast routing decisions. Third, a multiagent reinforcement learning-based method that combines online and offline training is designed to reduce the dependence on the real-time environment and increase the convergence speed of multiple agents.

[AI-77] COLUMBUS: Evaluating COgnitive Lateral Understanding through Multiple-choice reBUSes AAAI-25

链接: https://arxiv.org/abs/2409.04053
作者: Koen Kraaijveld,Yifan Jiang,Kaixin Ma,Filip Ilievski
关键词-EN: reasoning techniques, catalyzed the development, development of reasoning, focused on vertical, VQA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 18 pages, 10 figures, submitted to AAAI-25

点击查看摘要

Abstract:While visual question-answering (VQA) benchmarks have catalyzed the development of reasoning techniques, they have focused on vertical thinking. Effective problem-solving also necessitates lateral thinking, which remains understudied in AI and has not been used to test visual perception systems. To bridge this gap, we formulate visual lateral thinking as a multiple-choice question-answering task and describe a three-step taxonomy-driven methodology for instantiating task examples. Then, we develop COLUMBUS, a synthetic benchmark that applies the task pipeline to create QA sets with text and icon rebus puzzles based on publicly available collections of compounds and common phrases. COLUMBUS comprises over 1,000 puzzles, each with four answer candidates. While the SotA vision-language models (VLMs) achieve decent performance, our evaluation demonstrates a substantial gap between humans and models. VLMs benefit from human-curated descriptions but struggle to self-generate such representations at the right level of abstraction.

[AI-78] FairEvalLLM. A Comprehensive Framework for Benchmarking Fairness in Large Language Model Recommender Systems

链接: https://arxiv.org/abs/2405.02219
作者: Yashar Deldjoo
关键词-EN: Large Language Models, Language Models, Large Language, powered by Large, fairness dimensions including
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents a framework for evaluating fairness in recommender systems powered by Large Language Models (RecLLMs), addressing the need for a unified approach that spans various fairness dimensions including sensitivity to user attributes, intrinsic fairness, and discussions of fairness based on underlying benefits. In addition, our framework introduces counterfactual evaluations and integrates diverse user group considerations to enhance the discourse on fairness evaluation for RecLLMs. Our key contributions include the development of a robust framework for fairness evaluation in LLM-based recommendations and a structured method to create \textitinformative user profiles from demographic data, historical user preferences, and recent interactions. We argue that the latter is essential for enhancing personalization in such systems, especially in temporal-driven scenarios. We demonstrate the utility of our framework through practical applications on two datasets, LastFM-1K and ML-1M. We conduct experiments on a subsample of 80 users from each dataset, testing and assessing the effectiveness of various prompt construction scenarios and in-context learning, comprising more than 50 scenarios. This results in more than 4000 recommendations (80 * 50 = 4000). Our study reveals that while there are no significant unfairness issues in scenarios involving sensitive attributes, some concerns remain. However, in terms of intrinsic fairness, which does not involve direct sensitivity, unfairness across demographic groups remains significant. The code and data used for this paper are available at: \urlthis https URL. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2405.02219 [cs.IR] (or arXiv:2405.02219v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2405.02219 Focus to learn more arXiv-issued DOI via DataCite

[AI-79] Generative User-Experience Research for Developing Domain-specific Natural Language Processing Applications

链接: https://arxiv.org/abs/2306.16143
作者: Anastasia Zhukova,Lukas von Sperl,Christian E. Matt,Bela Gipp
关键词-EN: human-computer interaction, increasing intuitiveness, part of human-computer, NLP, focuses on increasing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

[AI-80] owards Agent ic AI on Particle Accelerators NEURIPS

链接: https://arxiv.org/abs/2409.06336
作者: Antonin Sulc,Thorsten Hellert,Raimund Kammering,Hayden Houscher,Jason St. John
关键词-EN: achieving optimal performance, methods face increasing, face increasing challenges, Large Language Models, traditional control methods
类目: Accelerator Physics (physics.acc-ph); Artificial Intelligence (cs.AI)
*备注: 4 pages, 3 figures, Machine Learning and the Physical Sciences at Workshop at the 38th conference on Neural Information Processing Systems (NeurIPS)

点击查看摘要

Abstract:As particle accelerators grow in complexity, traditional control methods face increasing challenges in achieving optimal performance. This paper envisions a paradigm shift: a decentralized multi-agent framework for accelerator control, powered by Large Language Models (LLMs) and distributed among autonomous agents. We present a proposition of a self-improving decentralized system where intelligent agents handle high-level tasks and communication and each agent is specialized control individual accelerator components. This approach raises some questions: What are the future applications of AI in particle accelerators? How can we implement an autonomous complex system such as a particle accelerator where agents gradually improve through experience and human feedback? What are the implications of integrating a human-in-the-loop component for labeling operational data and providing expert guidance? We show two examples, where we demonstrate viability of such architecture. Comments: 4 pages, 3 figures, Machine Learning and the Physical Sciences at Workshop at the 38th conference on Neural Information Processing Systems (NeurIPS) Subjects: Accelerator Physics (physics.acc-ph); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.06336 [physics.acc-ph] (or arXiv:2409.06336v1 [physics.acc-ph] for this version) https://doi.org/10.48550/arXiv.2409.06336 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-81] Multiclass Arrhythmia Classification using Smartwatch Photoplethysmography Signals Collected in Real-life Settings

链接: https://arxiv.org/abs/2409.06147
作者: Dong Han,Jihye Moon,Luís Roberto Mercado Díaz,Darren Chen,Devan Williams,Eric Y. Ding,Khanh-Van Tran,David D. McManus,Ki H. Chon
关键词-EN: Gated Recurrent Unit, multiclass arrhythmia classification, bi-directional Gated Recurrent, ventricular contraction, deep learning models
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Most deep learning models of multiclass arrhythmia classification are tested on fingertip photoplethysmographic (PPG) data, which has higher signal-to-noise ratios compared to smartwatch-derived PPG, and the best reported sensitivity value for premature atrial/ventricular contraction (PAC/PVC) detection is only 75%. To improve upon PAC/PVC detection sensitivity while maintaining high AF detection, we use multi-modal data which incorporates 1D PPG, accelerometers, and heart rate data as the inputs to a computationally efficient 1D bi-directional Gated Recurrent Unit (1D-Bi-GRU) model to detect three arrhythmia classes. We used motion-artifact prone smartwatch PPG data from the NIH-funded Pulsewatch clinical trial. Our multimodal model tested on 72 subjects achieved an unprecedented 83% sensitivity for PAC/PVC detection while maintaining a high accuracy of 97.31% for AF detection. These results outperformed the best state-of-the-art model by 20.81% for PAC/PVC and 2.55% for AF detection even while our model was computationally more efficient (14 times lighter and 2.7 faster).

[AI-82] DeepFM-Crispr: Prediction of CRISPR On-Target Effects via Deep Learning ICML

链接: https://arxiv.org/abs/2409.05938
作者: Condy Bao,Fuxiao Liu
关键词-EN: groundbreaking gene-editing technology, enables precise genomic, precise genomic modifications, RNA guide sequence, short RNA guide
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 page, 2 figures, accepted to ICMLA 2024

点击查看摘要

Abstract:Since the advent of CRISPR-Cas9, a groundbreaking gene-editing technology that enables precise genomic modifications via a short RNA guide sequence, there has been a marked increase in the accessibility and application of this technology across various fields. The success of CRISPR-Cas9 has spurred further investment and led to the discovery of additional CRISPR systems, including CRISPR-Cas13. Distinct from Cas9, which targets DNA, Cas13 targets RNA, offering unique advantages for gene modulation. We focus on Cas13d, a variant known for its collateral activity where it non-specifically cleaves adjacent RNA molecules upon activation, a feature critical to its function. We introduce DeepFM-Crispr, a novel deep learning model developed to predict the on-target efficiency and evaluate the off-target effects of Cas13d. This model harnesses a large language model to generate comprehensive representations rich in evolutionary and structural data, thereby enhancing predictions of RNA secondary structures and overall sgRNA efficacy. A transformer-based architecture processes these inputs to produce a predictive efficacy score. Comparative experiments show that DeepFM-Crispr not only surpasses traditional models but also outperforms recent state-of-the-art deep learning methods in terms of prediction accuracy and reliability.

[AI-83] Unlocking Potential Binders: Multimodal Pretraining DEL-Fusion for Denoising DNA-Encoded Libraries

链接: https://arxiv.org/abs/2409.05916
作者: Chunbin Gu,Mutian He,Hanqun Cao,Guangyong Chen,Chang-yu Hsieh,Pheng Ann Heng
关键词-EN: DNA-encoded library, DEL, DEL screening faces, technology has emerged, compound
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:In the realm of drug discovery, DNA-encoded library (DEL) screening technology has emerged as an efficient method for identifying high-affinity compounds. However, DEL screening faces a significant challenge: noise arising from nonspecific interactions within complex biological systems. Neural networks trained on DEL libraries have been employed to extract compound features, aiming to denoise the data and uncover potential binders to the desired therapeutic target. Nevertheless, the inherent structure of DEL, constrained by the limited diversity of building blocks, impacts the performance of compound encoders. Moreover, existing methods only capture compound features at a single level, further limiting the effectiveness of the denoising strategy. To mitigate these issues, we propose a Multimodal Pretraining DEL-Fusion model (MPDF) that enhances encoder capabilities through pretraining and integrates compound features across various scales. We develop pretraining tasks applying contrastive objectives between different compound representations and their text descriptions, enhancing the compound encoders’ ability to acquire generic features. Furthermore, we propose a novel DEL-fusion framework that amalgamates compound information at the atomic, submolecular, and molecular levels, as captured by various compound encoders. The synergy of these innovations equips MPDF with enriched, multi-scale features, enabling comprehensive downstream denoising. Evaluated on three DEL datasets, MPDF demonstrates superior performance in data processing and analysis for validation tasks. Notably, MPDF offers novel insights into identifying high-affinity molecules, paving the way for improved DEL utility in drug discovery.

[AI-84] Property Neurons in Self-Supervised Speech Transformers

链接: https://arxiv.org/abs/2409.05910
作者: Tzu-Quan Lin,Guan-Ting Lin,Hung-yi Lee,Hao Tang
关键词-EN: analyzing self-supervised speech, self-supervised speech Transformers, layer-wise analysis, studies on analyzing, analyzing self-supervised
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by SLT 2024

点击查看摘要

Abstract:There have been many studies on analyzing self-supervised speech Transformers, in particular, with layer-wise analysis. It is, however, desirable to have an approach that can pinpoint exactly a subset of neurons that is responsible for a particular property of speech, being amenable to model pruning and model editing. In this work, we identify a set of property neurons in the feedforward layers of Transformers to study how speech-related properties, such as phones, gender, and pitch, are stored. When removing neurons of a particular property (a simple form of model editing), the respective downstream performance significantly degrades, showing the importance of the property neurons. We apply this approach to pruning the feedforward layers in Transformers, where most of the model parameters are. We show that protecting property neurons during pruning is significantly more effective than norm-based pruning.

计算机视觉

[CV-0] GeoCalib: Learning Single-image Calibration with Geometric Optimization ECCV2024

链接: https://arxiv.org/abs/2409.06704
作者: Alexander Veicht,Paul-Edouard Sarlin,Philipp Lindenberger,Marc Pollefeys
关键词-EN: gravity direction, deduce intrinsic, intrinsic and extrinsic, focal length, single image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Presented at ECCV 2024

点击查看摘要

Abstract:From a single image, visual cues can help deduce intrinsic and extrinsic camera parameters like the focal length and the gravity direction. This single-image calibration can benefit various downstream applications like image editing and 3D mapping. Current approaches to this problem are based on either classical geometry with lines and vanishing points or on deep neural networks trained end-to-end. The learned approaches are more robust but struggle to generalize to new environments and are less accurate than their classical counterparts. We hypothesize that they lack the constraints that 3D geometry provides. In this work, we introduce GeoCalib, a deep neural network that leverages universal rules of 3D geometry through an optimization process. GeoCalib is trained end-to-end to estimate camera parameters and learns to find useful visual cues from the data. Experiments on various benchmarks show that GeoCalib is more robust and more accurate than existing classical and learned approaches. Its internal optimization estimates uncertainties, which help flag failure cases and benefit downstream applications like visual localization. The code and trained models are publicly available at this https URL.

[CV-1] LEIA: Latent View-invariant Embeddings for Implicit 3D Articulation ECCV2024

链接: https://arxiv.org/abs/2409.06703
作者: Archana Swaminathan,Anubhav Gupta,Kamal Gupta,Shishira R. Maiya,Vatsal Agarwal,Abhinav Shrivastava
关键词-EN: Neural Radiance Fields, Neural Radiance, Radiance Fields, offering unprecedented quality, offering unprecedented
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024. Project Website at this https URL

点击查看摘要

Abstract:Neural Radiance Fields (NeRFs) have revolutionized the reconstruction of static scenes and objects in 3D, offering unprecedented quality. However, extending NeRFs to model dynamic objects or object articulations remains a challenging problem. Previous works have tackled this issue by focusing on part-level reconstruction and motion estimation for objects, but they often rely on heuristics regarding the number of moving parts or object categories, which can limit their practical use. In this work, we introduce LEIA, a novel approach for representing dynamic 3D objects. Our method involves observing the object at distinct time steps or “states” and conditioning a hypernetwork on the current state, using this to parameterize our NeRF. This approach allows us to learn a view-invariant latent representation for each state. We further demonstrate that by interpolating between these states, we can generate novel articulation configurations in 3D space that were previously unseen. Our experimental results highlight the effectiveness of our method in articulating objects in a manner that is independent of the viewing angle and joint configuration. Notably, our approach outperforms previous methods that rely on motion information for articulation registration.

[CV-2] Hint-AD: Holistically Aligned Interpretability in End-to-End Autonomous Driving

点击查看摘要

[CV-3] GigaGS: Scaling up Planar-Based 3D Gaussians for Large Scene Surface Reconstruction

链接: https://arxiv.org/abs/2409.06685
作者: Junyi Chen,Weicai Ye,Yifan Wang,Danpeng Chen,Di Huang,Wanli Ouyang,Guofeng Zhang,Yu Qiao,Tong He
关键词-EN: Gaussian Splatting, shown promising performance, view synthesis, shown promising, promising performance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has shown promising performance in novel view synthesis. Previous methods adapt it to obtaining surfaces of either individual 3D objects or within limited scenes. In this paper, we make the first attempt to tackle the challenging task of large-scale scene surface reconstruction. This task is particularly difficult due to the high GPU memory consumption, different levels of details for geometric representation, and noticeable inconsistencies in appearance. To this end, we propose GigaGS, the first work for high-quality surface reconstruction for large-scale scenes using 3DGS. GigaGS first applies a partitioning strategy based on the mutual visibility of spatial regions, which effectively grouping cameras for parallel processing. To enhance the quality of the surface, we also propose novel multi-view photometric and geometric consistency constraints based on Level-of-Detail representation. In doing so, our method can reconstruct detailed surface structures. Comprehensive experiments are conducted on various datasets. The consistent improvement demonstrates the superiority of GigaGS.

[CV-4] Alignist: CAD-Informed Orientation Distribution Estimation by Fusing Shape and Correspondences ECCV2024

链接: https://arxiv.org/abs/2409.06683
作者: Shishir Reddy Vutukur,Rasmus Laurvig Haugaard,Junwen Huang,Benjamin Busam,Tolga Birdal
关键词-EN: Object pose distribution, CAD model, symmetric objects, pose distribution estimation, Object pose
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024

点击查看摘要

Abstract:Object pose distribution estimation is crucial in robotics for better path planning and handling of symmetric objects. Recent distribution estimation approaches employ contrastive learning-based approaches by maximizing the likelihood of a single pose estimate in the absence of a CAD model. We propose a pose distribution estimation method leveraging symmetry respecting correspondence distributions and shape information obtained using a CAD model. Contrastive learning-based approaches require an exhaustive amount of training images from different viewpoints to learn the distribution properly, which is not possible in realistic scenarios. Instead, we propose a pipeline that can leverage correspondence distributions and shape information from the CAD model, which are later used to learn pose distributions. Besides, having access to pose distribution based on correspondences before learning pose distributions conditioned on images, can help formulate the loss between distributions. The prior knowledge of distribution also helps the network to focus on getting sharper modes instead. With the CAD prior, our approach converges much faster and learns distribution better by focusing on learning sharper distribution near all the valid modes, unlike contrastive approaches, which focus on a single mode at a time. We achieve benchmark results on SYMSOL-I and T-Less datasets.

[CV-5] A Semantic Segmentation Approach on Sweet Orange Leaf Diseases Detection Utilizing YOLO

链接: https://arxiv.org/abs/2409.06671
作者: Sabit Ahamed Preanto(4IR Research Cell Daffodil International University, Dhaka, Bangladesh),Md. Taimur Ahad(4IR Research Cell Daffodil International University, Dhaka, Bangladesh),Yousuf Rayhan Emon(4IR Research Cell Daffodil International University, Dhaka, Bangladesh),Sumaya Mustofa(4IR Research Cell Daffodil International University, Dhaka, Bangladesh),Md Alamin(4IR Research Cell Daffodil International University, Dhaka, Bangladesh)
关键词-EN: sweet orange leaves, sweet oranges encounter, oranges encounter significant, artificial intelligence models, encounter significant threats
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This research introduces an advanced method for diagnosing diseases in sweet orange leaves by utilising advanced artificial intelligence models like YOLOv8 . Due to their significance as a vital agricultural product, sweet oranges encounter significant threats from a variety of diseases that harmfully affect both their yield and quality. Conventional methods for disease detection primarily depend on manual inspection which is ineffective and frequently leads to errors, resulting in delayed treatment and increased financial losses. In response to this challenge, the research utilized YOLOv8 , harnessing their proficiencies in detecting objects and analyzing images. YOLOv8 is recognized for its rapid and precise performance, while VIT is acknowledged for its detailed feature extraction abilities. Impressively, during both the training and validation stages, YOLOv8 exhibited a perfect accuracy of 80.4%, while VIT achieved an accuracy of 99.12%, showcasing their potential to transform disease detection in agriculture. The study comprehensively examined the practical challenges related to the implementation of AI technologies in agriculture, encompassing the computational demands and user accessibility, and offering viable solutions for broader usage. Moreover, it underscores the environmental considerations, particularly the potential for reduced pesticide usage, thereby promoting sustainable farming and environmental conservation. These findings provide encouraging insights into the application of AI in agriculture, suggesting a transition towards more effective, sustainable, and technologically advanced farming methods. This research not only highlights the efficacy of YOLOv8 within a specific agricultural domain but also lays the foundation for further studies that encompass a broader application in crop management and sustainable agricultural practices.

[CV-6] Data Collection-free Masked Video Modeling ECCV2024

链接: https://arxiv.org/abs/2409.06665
作者: Yuchi Ishikawa,Masayoshi Kondo,Yoshimitsu Aoki
关键词-EN: presenting significant challenges, transformers generally requires, presenting significant, related to privacy, inherent biases
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024

点击查看摘要

Abstract:Pre-training video transformers generally requires a large amount of data, presenting significant challenges in terms of data collection costs and concerns related to privacy, licensing, and inherent biases. Synthesizing data is one of the promising ways to solve these issues, yet pre-training solely on synthetic data has its own challenges. In this paper, we introduce an effective self-supervised learning framework for videos that leverages readily available and less costly static images. Specifically, we define the Pseudo Motion Generator (PMG) module that recursively applies image transformations to generate pseudo-motion videos from images. These pseudo-motion videos are then leveraged in masked video modeling. Our approach is applicable to synthetic images as well, thus entirely freeing video pre-training from data collection costs and other concerns in real data. Through experiments in action recognition tasks, we demonstrate that this framework allows effective learning of spatio-temporal features through pseudo-motion videos, significantly improving over existing methods which also use static images and partially outperforming those using both real and synthetic videos. These results uncover fragments of what video transformers learn through masked video modeling.

[CV-7] World-Grounded Human Motion Recovery via Gravity-View Coordinates SIGGRAPH

点击查看摘要

[CV-8] Image Vectorization with Depth: convexified shape layers with depth ordering

链接: https://arxiv.org/abs/2409.06648
作者: Ho Law,Sung Ha Kang
关键词-EN: vector graphic format, depth ordering, ordering, scalable vector graphic, depth
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Image vectorization is a process to convert a raster image into a scalable vector graphic format. Objective is to effectively remove the pixelization effect while representing boundaries of image by scaleable parameterized curves. We propose new image vectorization with depth which considers depth ordering among shapes and use curvature-based inpainting for convexifying shapes in vectorization process.From a given color quantized raster image, we first define each connected component of the same color as a shape layer, and construct depth ordering among them using a newly proposed depth ordering energy. Global depth ordering among all shapes is described by a directed graph, and we propose an energy to remove cycle within the graph. After constructing depth ordering of shapes, we convexify occluded regions by Euler’s elastica curvature-based variational inpainting, and leverage on the stability of Modica-Mortola double-well potential energy to inpaint large regions. This is following human vision perception that boundaries of shapes extend smoothly, and we assume shapes are likely to be convex. Finally, we fit Bézier curves to the boundaries and save vectorization as a SVG file which allows superposition of curvature-based inpainted shapes following the depth ordering. This is a new way to vectorize images, by decomposing an image into scalable shape layers with computed depth ordering. This approach makes editing shapes and images more natural and intuitive. We also consider grouping shape layers for semantic vectorization. We present various numerical results and comparisons against recent layer-based vectorization methods to validate the proposed model.

[CV-9] EyeCLIP: A visual-language foundation model for multi-modal ophthalmic image analysis

点击查看摘要

[CV-10] SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation

链接: https://arxiv.org/abs/2409.06633
作者: Teng Hu,Jiangning Zhang,Ran Yi,Hongrui Huang,Yabiao Wang,Lizhuang Ma
关键词-EN: Stable Diffusion series, Diffusion series playing, Stable Diffusion, video generation tasks, recent years
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Parameter efficient finetuning method

点击查看摘要

Abstract:In recent years, the development of diffusion models has led to significant progress in image and video generation tasks, with pre-trained models like the Stable Diffusion series playing a crucial role. Inspired by model pruning which lightens large pre-trained models by removing unimportant parameters, we propose a novel model fine-tuning method to make full use of these ineffective parameters and enable the pre-trained model with new task-specified capabilities. In this work, we first investigate the importance of parameters in pre-trained diffusion models, and discover that the smallest 10% to 20% of parameters by absolute values do not contribute to the generation process. Based on this observation, we propose a method termed SaRA that re-utilizes these temporarily ineffective parameters, equating to optimizing a sparse weight matrix to learn the task-specific knowledge. To mitigate overfitting, we propose a nuclear-norm-based low-rank sparse training scheme for efficient fine-tuning. Furthermore, we design a new progressive parameter adjustment strategy to make full use of the re-trained/finetuned parameters. Finally, we propose a novel unstructural backpropagation strategy, which significantly reduces memory costs during fine-tuning. Our method enhances the generative capabilities of pre-trained models in downstream applications and outperforms traditional fine-tuning methods like LoRA in maintaining model’s generalization ability. We validate our approach through fine-tuning experiments on SD models, demonstrating significant improvements. SaRA also offers a practical advantage that requires only a single line of code modification for efficient implementation and is seamlessly compatible with existing methods.

[CV-11] owards Localizing Structural Elements: Merging Geometrical Detection with Semantic Verification in RGB-D Data

链接: https://arxiv.org/abs/2409.06625
作者: Ali Tourani,Saad Ejaz,Hriday Bavle,Jose Luis Sanchez-Lopez,Holger Voos
关键词-EN: Visual Simultaneous Localization, cameras supply rich, RGB-D cameras supply, dense visual, Simultaneous Localization
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 6 pages, 5 figures. 3 tables

点击查看摘要

Abstract:RGB-D cameras supply rich and dense visual and spatial information for various robotics tasks such as scene understanding, map reconstruction, and localization. Integrating depth and visual information can aid robots in localization and element mapping, advancing applications like 3D scene graph generation and Visual Simultaneous Localization and Mapping (VSLAM). While point cloud data containing such information is primarily used for enhanced scene understanding, exploiting their potential to capture and represent rich semantic information has yet to be adequately targeted. This paper presents a real-time pipeline for localizing building components, including wall and ground surfaces, by integrating geometric calculations for pure 3D plane detection followed by validating their semantic category using point cloud data from RGB-D cameras. It has a parallel multi-thread architecture to precisely estimate poses and equations of all the planes detected in the environment, filters the ones forming the map structure using a panoptic segmentation validation, and keeps only the validated building components. Incorporating the proposed method into a VSLAM framework confirmed that constraining the map with the detected environment-driven semantic elements can improve scene understanding and map reconstruction accuracy. It can also ensure (re-)association of these detected components into a unified 3D scene graph, bridging the gap between geometric accuracy and semantic understanding. Additionally, the pipeline allows for the detection of potential higher-level structural entities, such as rooms, by identifying the relationships between building components based on their layout.

[CV-12] MVGaussian: High-Fidelity text-to-3D Content Generation with Multi-View Guidance and Surface Densification

链接: https://arxiv.org/abs/2409.06620
作者: Phu Pham,Aradhya N. Mathur,Ojaswa Sharma,Aniket Bera
关键词-EN: Score Distillation Sampling, Distillation Sampling, Score Distillation, made significant progress, methodologies like Score
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: 13 pages, 10 figures

点击查看摘要

Abstract:The field of text-to-3D content generation has made significant progress in generating realistic 3D objects, with existing methodologies like Score Distillation Sampling (SDS) offering promising guidance. However, these methods often encounter the “Janus” problem-multi-face ambiguities due to imprecise guidance. Additionally, while recent advancements in 3D gaussian splitting have shown its efficacy in representing 3D volumes, optimization of this representation remains largely unexplored. This paper introduces a unified framework for text-to-3D content generation that addresses these critical gaps. Our approach utilizes multi-view guidance to iteratively form the structure of the 3D model, progressively enhancing detail and accuracy. We also introduce a novel densification algorithm that aligns gaussians close to the surface, optimizing the structural integrity and fidelity of the generated models. Extensive experiments validate our approach, demonstrating that it produces high-quality visual outputs with minimal time cost. Notably, our method achieves high-quality results within half an hour of training, offering a substantial efficiency gain over most existing methods, which require hours of training time to achieve comparable results.

[CV-13] Hierarchical Multi-Label Classification with Missing Information for Benthic Habitat Imagery

链接: https://arxiv.org/abs/2409.06618
作者: Isaac Xu,Benjamin Misiuk,Scott C. Lowe,Martin Gillis,Craig J. Brown,Thomas Trappenberg
关键词-EN: self-supervised learning techniques, complex hierarchical multi-label, self-supervised learning, seafloor imagery, learning techniques
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we apply state-of-the-art self-supervised learning techniques on a large dataset of seafloor imagery, \textitBenthicNet, and study their performance for a complex hierarchical multi-label (HML) classification downstream task. In particular, we demonstrate the capacity to conduct HML training in scenarios where there exist multiple levels of missing annotation information, an important scenario for handling heterogeneous real-world data collected by multiple research groups with differing data collection protocols. We find that, when using smaller one-hot image label datasets typical of local or regional scale benthic science projects, models pre-trained with self-supervision on a larger collection of in-domain benthic data outperform models pre-trained on ImageNet. In the HML setting, we find the model can attain a deeper and more precise classification if it is pre-trained with self-supervision on in-domain data. We hope this work can establish a benchmark for future models in the field of automated underwater image annotation tasks and can guide work in other domains with hierarchical annotations of mixed resolution.

[CV-14] When to Extract ReID Features: A Selective Approach for Improved Multiple Object Tracking FAST

链接: https://arxiv.org/abs/2409.06617
作者: Emirhan Bayar,Cemal Aker
关键词-EN: Multiple Object Tracking, Multiple Object, Extracting and matching, matching Re-Identification, Object Tracking
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 5 figures. Presents a selective approach for ReID feature extraction in Multiple Object Tracking, reducing computational overhead while maintaining accuracy. Tested on StrongSORT and Deep OC-SORT using MOT17, MOT20, and DanceTrack datasets. Code: this https URL , this https URL

点击查看摘要

Abstract:Extracting and matching Re-Identification (ReID) features is used by many state-of-the-art (SOTA) Multiple Object Tracking (MOT) methods, particularly effective against frequent and long-term occlusions. While end-to-end object detection and tracking have been the main focus of recent research, they have yet to outperform traditional methods in benchmarks like MOT17 and MOT20. Thus, from an application standpoint, methods with separate detection and embedding remain the best option for accuracy, modularity, and ease of implementation, though they are impractical for edge devices due to the overhead involved. In this paper, we investigate a selective approach to minimize the overhead of feature extraction while preserving accuracy, modularity, and ease of implementation. This approach can be integrated into various SOTA methods. We demonstrate its effectiveness by applying it to StrongSORT and Deep OC-SORT. Experiments on MOT17, MOT20, and DanceTrack datasets show that our mechanism retains the advantages of feature extraction during occlusions while significantly reducing runtime. Additionally, it improves accuracy by preventing confusion in the feature-matching stage, particularly in cases of deformation and appearance similarity, which are common in DanceTrack. this https URL, this https URL

[CV-15] Improving the Precision of CNNs for Magnetic Resonance Spectral Modeling

链接: https://arxiv.org/abs/2409.06609
作者: John LaMaster,Dhritiman Das,Florian Kofler,Jason Crane,Yan Li,Tobias Lasser,Bjoern H Menze
关键词-EN: Magnetic resonance spectroscopic, resonance spectroscopic imaging, Magnetic resonance, tissue of interest, integrate clinically
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 11 pages, 1 figure, 2 tables

点击查看摘要

Abstract:Magnetic resonance spectroscopic imaging is a widely available imaging modality that can non-invasively provide a metabolic profile of the tissue of interest, yet is challenging to integrate clinically. One major reason is the expensive, expert data processing and analysis that is required. Using machine learning to predict MRS-related quantities offers avenues around this problem, but deep learning models bring their own challenges, especially model trust. Current research trends focus primarily on mean error metrics, but comprehensive precision metrics are also needed, e.g. standard deviations, confidence intervals, etc… This work highlights why more comprehensive error characterization is important and how to improve the precision of CNNs for spectral modeling, a quantitative task. The results highlight advantages and trade-offs of these techniques that should be considered when addressing such regression tasks with CNNs. Detailed insights into the underlying mechanisms of each technique, and how they interact with other techniques, are discussed in depth.

[CV-16] A Practical Gated Recurrent Transformer Network Incorporating Multiple Fusions for Video Denoising

链接: https://arxiv.org/abs/2409.06603
作者: Kai Guo,Seungwon Choi,Jongseong Choi,Lae-Hoon Kim
关键词-EN: video denoising methods, simultaneous denoising mechanisms, denoising methods employ, methods employ multi-frame, employ multi-frame simultaneous
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 5 pages, 5 figures

点击查看摘要

Abstract:State-of-the-art (SOTA) video denoising methods employ multi-frame simultaneous denoising mechanisms, resulting in significant delays (e.g., 16 frames), making them impractical for real-time cameras. To overcome this limitation, we propose a multi-fusion gated recurrent Transformer network (GRTN) that achieves SOTA denoising performance with only a single-frame delay. Specifically, the spatial denoising module extracts features from the current frame, while the reset gate selects relevant information from the previous frame and fuses it with current frame features via the temporal denoising module. The update gate then further blends this result with the previous frame features, and the reconstruction module integrates it with the current frame. To robustly compute attention for noisy features, we propose a residual simplified Swin Transformer with Euclidean distance (RSSTE) in the spatial and temporal denoising modules. Comparative objective and subjective results show that our GRTN achieves denoising performance comparable to SOTA multi-frame delay networks, with only a single-frame delay.

[CV-17] Lightweight Multiscale Feature Fusion Super-Resolution Network Based on Two-branch Convolution and Transformer

链接: https://arxiv.org/abs/2409.06590
作者: Li Ke,Liu Yukai
关键词-EN: single image super-resolution, convolutional neural networks, network model based, model, image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages,12 figures

点击查看摘要

Abstract:The single image super-resolution(SISR) algorithms under deep learning currently have two main models, one based on convolutional neural networks and the other based on Transformer. The former uses the stacking of convolutional layers with different convolutional kernel sizes to design the model, which enables the model to better extract the local features of the image; the latter uses the self-attention mechanism to design the model, which allows the model to establish long-distance dependencies between image pixel points through the self-attention mechanism and then better extract the global features of the image. However, both of the above methods face their problems. Based on this, this paper proposes a new lightweight multi-scale feature fusion network model based on two-way complementary convolutional and Transformer, which integrates the respective features of Transformer and convolutional neural networks through a two-branch network architecture, to realize the mutual fusion of global and local information. Meanwhile, considering the partial loss of information caused by the low-pixel images trained by the deep neural network, this paper designs a modular connection method of multi-stage feature supplementation to fuse the feature maps extracted from the shallow stage of the model with those extracted from the deep stage of the model, to minimize the loss of the information in the feature images that is beneficial to the image restoration as much as possible, to facilitate the obtaining of a higher-quality restored image. The practical results finally show that the model proposed in this paper is optimal in image recovery performance when compared with other lightweight models with the same amount of parameters.

[CV-18] Seg-HGNN: Unsupervised and Light-Weight Image Segmentation with Hyperbolic Graph Neural Networks BMVC2024

链接: https://arxiv.org/abs/2409.06589
作者: Debjyoti Mondal,Rahul Mishra,Chandan Pandey
关键词-EN: euclidean space, space through linear, linear hyperspaces, Image analysis, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: BMVC 2024

点击查看摘要

Abstract:Image analysis in the euclidean space through linear hyperspaces is well studied. However, in the quest for more effective image representations, we turn to hyperbolic manifolds. They provide a compelling alternative to capture complex hierarchical relationships in images with remarkably small dimensionality. To demonstrate hyperbolic embeddings’ competence, we introduce a light-weight hyperbolic graph neural network for image segmentation, encompassing patch-level features in a very small embedding size. Our solution, Seg-HGNN, surpasses the current best unsupervised method by 2.5%, 4% on VOC-07, VOC-12 for localization, and by 0.8%, 1.3% on CUB-200, ECSSD for segmentation, respectively. With less than 7.5k trainable parameters, Seg-HGNN delivers effective and fast ( \approx 2 images/second) results on very standard GPUs like the GTX1650. This empirical evaluation presents compelling evidence of the efficacy and potential of hyperbolic representations for vision tasks.

[CV-19] ranstreaming: Adaptive Delay-aware Transformer for Real-time Streaming Perception AAAI2025

链接: https://arxiv.org/abs/2409.06584
作者: Xiang Zhang,Yufei Cui,Chenchen Fu,Weiwei Wu,Zihao Wang,Yuyang Sun,Xue Liu
关键词-EN: Real-time object detection, Real-time object, object detection, decision-making process, collision avoidance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to AAAI 2025

点击查看摘要

Abstract:Real-time object detection is critical for the decision-making process for many real-world applications, such as collision avoidance and path planning in autonomous driving. This work presents an innovative real-time streaming perception method, Transtreaming, which addresses the challenge of real-time object detection with dynamic computational delay. The core innovation of Transtreaming lies in its adaptive delay-aware transformer, which can concurrently predict multiple future frames and select the output that best matches the real-world present time, compensating for any system-induced computation delays. The proposed model outperforms the existing state-of-the-art methods, even in single-frame detection scenarios, by leveraging a transformer-based methodology. It demonstrates robust performance across a range of devices, from powerful V100 to modest 2080Ti, achieving the highest level of perceptual accuracy on all platforms. Unlike most state-of-the-art methods that struggle to complete computation within a single frame on less powerful devices, Transtreaming meets the stringent real-time processing requirements on all kinds of devices. The experimental results emphasize the system’s adaptability and its potential to significantly improve the safety and reliability for many real-world systems, such as autonomous driving.

[CV-20] Semi-Supervised 3D Object Detection with Chanel Augmentation using Transformation Equivariance ICIP

链接: https://arxiv.org/abs/2409.06583
作者: Minju Kang,Taehun Kong,Tae-Kyun Kim
关键词-EN: safely and effectively, crucial for autonomous, autonomous vehicles, vehicles and robots, robots to navigate
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to 2024 IEEE International Conference on Image Processing (ICIP)

点击查看摘要

Abstract:Accurate 3D object detection is crucial for autonomous vehicles and robots to navigate and interact with the environment safely and effectively. Meanwhile, the performance of 3D detector relies on the data size and annotation which is expensive. Consequently, the demand of training with limited labeled data is growing. We explore a novel teacher-student framework employing channel augmentation for 3D semi-supervised object detection. The teacher-student SSL typically adopts a weak augmentation and strong augmentation to teacher and student, respectively. In this work, we apply multiple channel augmentations to both networks using the transformation equivariance detector (TED). The TED allows us to explore different combinations of augmentation on point clouds and efficiently aggregates multi-channel transformation equivariance features. In principle, by adopting fixed channel augmentations for the teacher network, the student can train stably on reliable pseudo-labels. Adopting strong channel augmentations can enrich the diversity of data, fostering robustness to transformations and enhancing generalization performance of the student network. We use SOTA hierarchical supervision as a baseline and adapt its dual-threshold to TED, which is called channel IoU consistency. We evaluate our method with KITTI dataset, and achieved a significant performance leap, surpassing SOTA 3D semi-supervised object detection models.

[CV-21] Quantifying and Enabling the Interpretability of CLIP-like Models

点击查看摘要

[CV-22] PoseEmbroider: Towards a 3D Visual Semantic-aware Human Pose Representation ECCV2024

链接: https://arxiv.org/abs/2409.06535
作者: Ginger Delmas,Philippe Weinzaepfel,Francesc Moreno-Noguer,Grégory Rogez
关键词-EN: Aligning multiple modalities, Aligning multiple, produce powerful semantic, powerful semantic visual, latent space
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Published in ECCV 2024

点击查看摘要

Abstract:Aligning multiple modalities in a latent space, such as images and texts, has shown to produce powerful semantic visual representations, fueling tasks like image captioning, text-to-image generation, or image grounding. In the context of human-centric vision, albeit CLIP-like representations encode most standard human poses relatively well (such as standing or sitting), they lack sufficient acuteness to discern detailed or uncommon ones. Actually, while 3D human poses have been often associated with images (e.g. to perform pose estimation or pose-conditioned image generation), or more recently with text (e.g. for text-to-pose generation), they have seldom been paired with both. In this work, we combine 3D poses, person’s pictures and textual pose descriptions to produce an enhanced 3D-, visual- and semantic-aware human pose representation. We introduce a new transformer-based model, trained in a retrieval fashion, which can take as input any combination of the aforementioned modalities. When composing modalities, it outperforms a standard multi-modal alignment retrieval model, making it possible to sort out partial information (e.g. image with the lower body occluded). We showcase the potential of such an embroidered pose representation for (1) SMPL regression from image with optional text cue; and (2) on the task of fine-grained instruction generation, which consists in generating a text that describes how to move from one 3D pose to another (as a fitness coach). Unlike prior works, our model can take any kind of input (image and/or pose) without retraining.

[CV-23] In Flight Boresight Rectification for Lightweight Airborne Pushbroom Imaging Spectrometry

链接: https://arxiv.org/abs/2409.06520
作者: Julien Yuuki Burkhard,Jesse Ray Murray Lahaye,Laurent Valentin Jospin,Jan Skaloud
关键词-EN: UAV or small, small aircraft, RGB or Multispectral, recently been miniaturized, miniaturized for operation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:Hyperspectral cameras have recently been miniaturized for operation on lightweight airborne platforms such as UAV or small aircraft. Unlike frame cameras (RGB or Multispectral), many hyperspectral sensors use a linear array or ‘push-broom’ scanning design. This design presents significant challenges for image rectification and the calibration of the intrinsic and extrinsic camera parameters. Typically, methods employed to address such tasks rely on a precise GPS/INS estimate of the airborne platform trajectory and a detailed terrain model. However, inaccuracies in the trajectory or surface model information can introduce systematic errors and complicate geometric modeling which ultimately degrade the quality of the rectification. To overcome these challenges, we propose a method for tie point extraction and camera calibration for ‘push-broom’ hyperspectral sensors using only the raw spectral imagery and raw, possibly low quality, GPS/INS trajectory. We demonstrate that our approach allows for the automatic calibration of airborne systems with hyperspectral cameras, outperforms other state-of-the-art automatic rectification methods and reaches an accuracy on par with manual calibration methods.

[CV-24] Aligning Machine and Human Visual Representations across Abstraction Levels

点击查看摘要

[CV-25] Neural Laplacian Operator for 3D Point Clouds SIGGRAPH

链接: https://arxiv.org/abs/2409.06506
作者: Bo Pang,Zhongtian Zheng,Yilong Li,Guoping Wang,Peng-Shuai Wang
关键词-EN: Laplacian operator, ground-truth Laplacian operator, discrete Laplacian operator, Laplacian operator holds, learned Laplacian operator
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: SIGGRAPH Asia 2024 (Journal Track)

点击查看摘要

Abstract:The discrete Laplacian operator holds a crucial role in 3D geometry processing, yet it is still challenging to define it on point clouds. Previous works mainly focused on constructing a local triangulation around each point to approximate the underlying manifold for defining the Laplacian operator, which may not be robust or accurate. In contrast, we simply use the K-nearest neighbors (KNN) graph constructed from the input point cloud and learn the Laplacian operator on the KNN graph with graph neural networks (GNNs). However, the ground-truth Laplacian operator is defined on a manifold mesh with a different connectivity from the KNN graph and thus cannot be directly used for training. To train the GNN, we propose a novel training scheme by imitating the behavior of the ground-truth Laplacian operator on a set of probe functions so that the learned Laplacian operator behaves similarly to the ground-truth Laplacian operator. We train our network on a subset of ShapeNet and evaluate it across a variety of point clouds. Compared with previous methods, our method reduces the error by an order of magnitude and excels in handling sparse point clouds with thin structures or sharp features. Our method also demonstrates a strong generalization ability to unseen shapes. With our learned Laplacian operator, we further apply a series of Laplacian-based geometry processing algorithms directly to point clouds and achieve accurate results, enabling many exciting possibilities for geometry processing on point clouds. The code and trained models are available at this https URL.

[CV-26] Elucidating Optimal Reward-Diversity Tradeoffs in Text-to-Image Diffusion Models

点击查看摘要

[CV-27] UAVDB: Trajectory-Guided Adaptable Bounding Boxes for UAV Detection

链接: https://arxiv.org/abs/2409.06490
作者: Yu-Hsi Chen
关键词-EN: Unmanned Aerial Vehicles, Aerial Vehicles, Unmanned Aerial, Patch Intensity Convergence, drone technology
类目: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
*备注: 7 pages, 5 figures, 3 tables

点击查看摘要

Abstract:With the rapid development of drone technology, accurate detection of Unmanned Aerial Vehicles (UAVs) has become essential for applications such as surveillance, security, and airspace management. In this paper, we propose a novel trajectory-guided method, the Patch Intensity Convergence (PIC) technique, which generates high-fidelity bounding boxes for UAV detection tasks and no need for the effort required for labeling. The PIC technique forms the foundation for developing UAVDB, a database explicitly created for UAV detection. Unlike existing datasets, which often use low-resolution footage or focus on UAVs in simple backgrounds, UAVDB employs high-resolution video to capture UAVs at various scales, ranging from hundreds of pixels to nearly single-digit sizes. This broad-scale variation enables comprehensive evaluation of detection algorithms across different UAV sizes and distances. Applying the PIC technique, we can also efficiently generate detection datasets from trajectory or positional data, even without size information. We extensively benchmark UAVDB using YOLOv8 series detectors, offering a detailed performance analysis. Our findings highlight UAVDB’s potential as a vital database for advancing UAV detection, particularly in high-resolution and long-distance tracking scenarios.

[CV-28] Mitigating Hallucination in Visual-Language Models via Re-Balancing Contrastive Decoding

链接: https://arxiv.org/abs/2409.06485
作者: Xiaoyu Liang,Jiayuan Yu,Lianrui Mu,Jiedong Zhuang,Jiaqi Hu,Yuchen Yang,Jiangnan Ye,Lu Lu,Jian Chen,Haoji Hu
关键词-EN: visual question answering, shown impressive capabilities, shown impressive, question answering, attention distribution
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: PRCV

点击查看摘要

Abstract:Although Visual-Language Models (VLMs) have shown impressive capabilities in tasks like visual question answering and image captioning, they still struggle with hallucinations. Analysis of attention distribution in these models shows that VLMs tend to processing textual tokens rather than visual tokens. This imbalance of attention distribution causes VLMs to favor textual knowledge in the case of multimodal knowledge conflicts, resulting in differences from the image information. In this paper, we propose Re-Balancing Contrastive Decoding (RBD) method, which employs textual and visual branches to recalibrate attention distribution in VLMs. Specifically, the textual branch injects image noise to stimulate the model’s dependency on text, thereby reducing textual bias. Concurrently, the visual branch focuses on the selection of significant tokens, refining the attention mechanism to highlight the primary subject. This dual-branch strategy enables the RBD method to diminish textual bias while enhancing visual information. Experimental results demonstrate that our method, RBD, outperforms the existing methods by the CHAIR and POPE metrics, mitigate hallucinations without reducing the model’s general capabilities.

[CV-29] NeIn: Telling What You Dont Want

链接: https://arxiv.org/abs/2409.06481
作者: Nhat-Tan Bui,Dinh-Hieu Hoang,Quoc-Huy Trinh,Minh-Triet Tran,Truong Nguyen,Susan Gauch
关键词-EN: fundamental linguistic concept, fundamental linguistic, linguistic concept, humans to convey, convey information
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Negation is a fundamental linguistic concept used by humans to convey information that they do not desire. Despite this, there has been minimal research specifically focused on negation within vision-language tasks. This lack of research means that vision-language models (VLMs) may struggle to understand negation, implying that they struggle to provide accurate results. One barrier to achieving human-level intelligence is the lack of a standard collection by which research into negation can be evaluated. This paper presents the first large-scale dataset, Negative Instruction (NeIn), for studying negation within the vision-language domain. Our dataset comprises 530,694 quadruples, i.e., source image, original caption, negative sentence, and target image in total, including 495,694 queries for training and 35,000 queries for benchmarking across multiple vision-language tasks. Specifically, we automatically generate NeIn based on a large, existing vision-language dataset, MS-COCO, via two steps: generation and filtering. During the generation phase, we leverage two VLMs, BLIP and MagicBrush, to generate the target image and a negative clause that expresses the content of the source image. In the subsequent filtering phase, we apply BLIP to remove erroneous samples. Additionally, we introduce an evaluation protocol for negation understanding of image editing models. Extensive experiments using our dataset across multiple VLMs for instruction-based image editing tasks demonstrate that even recent state-of-the-art VLMs struggle to understand negative queries. The project page is: this https URL

[CV-30] Multi-scale Cycle Tracking in Dynamic Planar Graphs

链接: https://arxiv.org/abs/2409.06476
作者: Farhan Rasheed,Abrar Naseer,Emma Nilsson,Talha Bin Masood,Ingrid Hotz
关键词-EN: paper presents, framework for analyzing, nested tracking framework, granular materials, force networks
类目: Graphics (cs.GR); Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV)
*备注: TopoInVis 2024, 11 pages

点击查看摘要

Abstract:This paper presents a nested tracking framework for analyzing cycles in 2D force networks within granular materials. These materials are composed of interacting particles, whose interactions are described by a force network. Understanding the cycles within these networks at various scales and their evolution under external loads is crucial, as they significantly contribute to the mechanical and kinematic properties of the system. Our approach involves computing a cycle hierarchy by partitioning the 2D domain into segments bounded by cycles in the force network. We can adapt concepts from nested tracking graphs originally developed for merge trees by leveraging the duality between this partitioning and the cycles. We demonstrate the effectiveness of our method on two force networks derived from experiments with photoelastic disks.

[CV-31] Weakly-supervised Camera Localization by Ground-to-satellite Image Registration ECCV2024

链接: https://arxiv.org/abs/2409.06471
作者: Yujiao Shi,Hongdong Li,Akhil Perincherry,Ankit Vora
关键词-EN: accurate GPS labels, ground camera localization, Real Time Kinematics, GPS labels, initially proposed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:The ground-to-satellite image matching/retrieval was initially proposed for city-scale ground camera localization. This work addresses the problem of improving camera pose accuracy by ground-to-satellite image matching after a coarse location and orientation have been obtained, either from the city-scale retrieval or from consumer-level GPS and compass sensors. Existing learning-based methods for solving this task require accurate GPS labels of ground images for network training. However, obtaining such accurate GPS labels is difficult, often requiring an expensive \colorblackReal Time Kinematics (RTK) setup and suffering from signal occlusion, multi-path signal disruptions, \etc. To alleviate this issue, this paper proposes a weakly supervised learning strategy for ground-to-satellite image registration when only noisy pose labels for ground images are available for network training. It derives positive and negative satellite images for each ground image and leverages contrastive learning to learn feature representations for ground and satellite images useful for translation estimation. We also propose a self-supervision strategy for cross-view image relative rotation estimation, which trains the network by creating pseudo query and reference image pairs. Experimental results show that our weakly supervised learning strategy achieves the best performance on cross-area evaluation compared to recent state-of-the-art methods that are reliant on accurate pose labels for supervision.

[CV-32] Learning Generative Interactive Environments By Trained Agent Exploration

点击查看摘要

[CV-33] Knowledge Distillation via Query Selection for Detection Transformer

链接: https://arxiv.org/abs/2409.06443
作者: Yi Liu,Luting Wang,Zongheng Tang,Yue Liao,Yifan Sun,Lijun Zhang,Si Liu
关键词-EN: Transformers have revolutionized, object detection landscape, simplicity and efficacy, detection landscape, landscape by introducing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Transformers have revolutionized the object detection landscape by introducing DETRs, acclaimed for their simplicity and efficacy. Despite their advantages, the substantial size of these models poses significant challenges for practical deployment, particularly in resource-constrained environments. This paper addresses the challenge of compressing DETR by leveraging knowledge distillation, a technique that holds promise for maintaining model performance while reducing size. A critical aspect of DETRs’ performance is their reliance on queries to interpret object representations accurately. Traditional distillation methods often focus exclusively on positive queries, identified through bipartite matching, neglecting the rich information present in hard-negative queries. Our visual analysis indicates that hard-negative queries, focusing on foreground elements, are crucial for enhancing distillation outcomes. To this end, we introduce a novel Group Query Selection strategy, which diverges from traditional query selection in DETR distillation by segmenting queries based on their Generalized Intersection over Union (GIoU) with ground truth objects, thereby uncovering valuable hard-negative queries for distillation. Furthermore, we present the Knowledge Distillation via Query Selection for DETR (QSKD) framework, which incorporates Attention-Guided Feature Distillation (AGFD) and Local Alignment Prediction Distillation (LAPD). These components optimize the distillation process by focusing on the most informative aspects of the teacher model’s intermediate features and output. Our comprehensive experimental evaluation of the MS-COCO dataset demonstrates the effectiveness of our approach, significantly improving average precision (AP) across various DETR architectures without incurring substantial computational costs. Specifically, the AP of Conditional DETR ResNet-18 increased from 35.8 to 39.9.

[CV-34] Prompt2Fashion: An automatically generated fashion dataset

链接: https://arxiv.org/abs/2409.06442
作者: Georgia Argyro,Angeliki Dimitriou,Maria Lymperaiou,Giorgos Filandrianos,Giorgos Stamou
关键词-EN: customized fashion solutions, vision generative models, Large Language Models, leverage generative models, AI-driven design
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Despite the rapid evolution and increasing efficacy of language and vision generative models, there remains a lack of comprehensive datasets that bridge the gap between personalized fashion needs and AI-driven design, limiting the potential for truly inclusive and customized fashion solutions. In this work, we leverage generative models to automatically construct a fashion image dataset tailored to various occasions, styles, and body types as instructed by users. We use different Large Language Models (LLMs) and prompting strategies to offer personalized outfits of high aesthetic quality, detail, and relevance to both expert and non-expert users’ requirements, as demonstrated by qualitative analysis. Up until now the evaluation of the generated outfits has been conducted by non-expert human subjects. Despite the provided fine-grained insights on the quality and relevance of generation, we extend the discussion on the importance of expert knowledge for the evaluation of artistic AI-generated datasets such as this one. Our dataset is publicly available on GitHub at this https URL.

[CV-35] A Likelihood Ratio-Based Approach to Segmenting Unknown Objects

链接: https://arxiv.org/abs/2409.06424
作者: Nazir Nayal,Youssef Shoeb,Fatma Güney
关键词-EN: perception systems operating, Large foundational models, open-world environment, Large foundational, prerequisite for perception
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 2 figures, and 4 tables

点击查看摘要

Abstract:Addressing the Out-of-Distribution (OoD) segmentation task is a prerequisite for perception systems operating in an open-world environment. Large foundational models are frequently used in downstream tasks, however, their potential for OoD remains mostly unexplored. We seek to leverage a large foundational model to achieve robust representation. Outlier supervision is a widely used strategy for improving OoD detection of the existing segmentation networks. However, current approaches for outlier supervision involve retraining parts of the original network, which is typically disruptive to the model’s learned feature representation. Furthermore, retraining becomes infeasible in the case of large foundational models. Our goal is to retrain for outlier segmentation without compromising the strong representation space of the foundational model. To this end, we propose an adaptive, lightweight unknown estimation module (UEM) for outlier supervision that significantly enhances the OoD segmentation performance without affecting the learned feature representation of the original network. UEM learns a distribution for outliers and a generic distribution for known classes. Using the learned distributions, we propose a likelihood-ratio-based outlier scoring function that fuses the confidence of UEM with that of the pixel-wise segmentation inlier network to detect unknown objects. We also propose an objective to optimize this score directly. Our approach achieves a new state-of-the-art across multiple datasets, outperforming the previous best method by 5.74% average precision points while having a lower false-positive rate. Importantly, strong inlier performance remains unaffected.

[CV-36] Sources of Uncertainty in 3D Scene Reconstruction ALT ECCV2024

链接: https://arxiv.org/abs/2409.06407
作者: Marcus Klasson,Riccardo Mereu,Juho Kannala,Arno Solin
关键词-EN: Neural Radiance Fields, Gaussian Splatting, real-world scenes, affected by numerous, Radiance Fields
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: To appear in ECCV 2024 Workshop Proceedings. Project page at this https URL

点击查看摘要

Abstract:The process of 3D scene reconstruction can be affected by numerous uncertainty sources in real-world scenes. While Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (GS) achieve high-fidelity rendering, they lack built-in mechanisms to directly address or quantify uncertainties arising from the presence of noise, occlusions, confounding outliers, and imprecise camera pose inputs. In this paper, we introduce a taxonomy that categorizes different sources of uncertainty inherent in these methods. Moreover, we extend NeRF- and GS-based methods with uncertainty estimation techniques, including learning uncertainty outputs and ensembles, and perform an empirical study to assess their ability to capture the sensitivity of the reconstruction. Our study highlights the need for addressing various uncertainty aspects when designing NeRF/GS-based methods for uncertainty-aware 3D reconstruction.

[CV-37] AMNS: Attention-Weighted Selective Mask and Noise Label Suppression for Text-to-Image Person Retrieval

链接: https://arxiv.org/abs/2409.06385
作者: Runqing Zhang,Xue Zhou
关键词-EN: training image-text pairs, image-text pairs due, person retrieval aims, image-text pairs, poor image quality
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Text-to-image person retrieval aims to retrieve images of person given textual descriptions, and most methods implicitly assume that the training image-text pairs are correctly aligned, but in practice, under-correlated and false-correlated problems arise for image-text pairs due to poor image quality and mislabeling. Meanwhile, the random masking augmentation strategy may incorrectly discard semantic content resulting in the problem of generating noisy pairings between image lexical elements and text descriptions. To solve these two problems, we propose a new noise label suppression method and alleviate the problem generated by random mask through an attention-weighted selective mask strategy. In the proposed noise label suppression method, the effect of noise labels is suppressed by preventing the model from being overconfident by considering the inverse KL scatter loss, which is combined with the weight adjustment focus loss to further improve the model’s recognition ability on difficult samples. On the other hand, Attention-Weighted Selective Mask processes the raw image through the EMA version of the image encoder, retaining some of the tokens with strong semantic associations with the corresponding text descriptions in order to extract better features. Numerous experiments validate the effectiveness of our approach in terms of dealing with noisy problems. The code will be available soon at this https URL.

[CV-38] A Cross-Font Image Retrieval Network for Recognizing Undeciphered Oracle Bone Inscriptions

链接: https://arxiv.org/abs/2409.06381
作者: Zhicong Wu,Qifeng Su,Ke Gu,Xiaodong Shi
关键词-EN: Oracle Bone Inscription, Oracle Bone, Bone Inscription, earliest mature writing, mature writing system
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Oracle Bone Inscription (OBI) is the earliest mature writing system known in China to date, which represents a crucial stage in the development of hieroglyphs. Nevertheless, the substantial quantity of undeciphered OBI characters continues to pose a persistent challenge for scholars, while conventional methods of ancient script research are both time-consuming and labor-intensive. In this paper, we propose a cross-font image retrieval network (CFIRN) to decipher OBI characters by establishing associations between OBI characters and other script forms, simulating the interpretive behavior of paleography scholars. Concretely, our network employs a siamese framework to extract deep features from character images of various fonts, fully exploring structure clues with different resolution by designed multiscale feature integration (MFI) module and multiscale refinement classifier (MRC). Extensive experiments on three challenging cross-font image retrieval datasets demonstrate that, given undeciphered OBI characters, our CFIRN can effectively achieve accurate matches with characters from other gallery fonts.

[CV-39] Distilling Generative-Discriminative Representations for Very Low-Resolution Face Recognition

点击查看摘要

[CV-40] xture-AD: An Anomaly Detection Dataset and Benchmark for Real Algorithm Development

点击查看摘要

[CV-41] DiffQRCoder: Diffusion-based Aesthetic QR Code Generation with Scanning Robustness Guided Iterative Refinement

链接: https://arxiv.org/abs/2409.06355
作者: Jia-Wei Liao,Winston Wang,Tzu-Sian Wang,Li-Xuan Peng,Ju-Hsuan Weng,Cheng-Fu Chou,Jun-Cheng Chen
关键词-EN: aesthetic Quick Response, Quick Response, image generation, Diffusion Models, code generation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the success of Diffusion Models for image generation, the technologies also have revolutionized the aesthetic Quick Response (QR) code generation. Despite significant improvements in visual attractiveness for the beautified codes, their scannabilities are usually sacrificed and thus hinder their practical uses in real-world scenarios. To address this issue, we propose a novel Diffusion-based QR Code generator (DiffQRCoder) to effectively craft both scannable and visually pleasing QR codes. The proposed approach introduces Scanning-Robust Perceptual Guidance (SRPG), a new diffusion guidance for Diffusion Models to guarantee the generated aesthetic codes to obey the ground-truth QR codes while maintaining their attractiveness during the denoising process. Additionally, we present another post-processing technique, Scanning Robust Manifold Projected Gradient Descent (SR-MPGD), to further enhance their scanning robustness through iterative latent space optimization. With extensive experiments, the results demonstrate that our approach not only outperforms other compared methods in Scanning Success Rate (SSR) with better or comparable CLIP aesthetic score (CLIP-aes.) but also significantly improves the SSR of the ControlNet-only approach from 60% to 99%. The subjective evaluation indicates that our approach achieves promising visual attractiveness to users as well. Finally, even with different scanning angles and the most rigorous error tolerance settings, our approach robustly achieves over 95% SSR, demonstrating its capability for real-world applications.

[CV-42] Multi-Weather Image Restoration via Histogram-Based Transformer Feature Enhancement

链接: https://arxiv.org/abs/2409.06334
作者: Yang Wen,Anyu Lai,Bo Qian,Hao Wang,Wuzhen Shi,Wenming Cao
关键词-EN: Task Intra-patch Block, adverse weather conditions, weather conditions, mainstream restoration tasks, predominantly focused
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: text overlap with arXiv:2409.03249

点击查看摘要

Abstract:Currently, the mainstream restoration tasks under adverse weather conditions have predominantly focused on single-weather scenarios. However, in reality, multiple weather conditions always coexist and their degree of mixing is usually unknown. Under such complex and diverse weather conditions, single-weather restoration models struggle to meet practical demands. This is particularly critical in fields such as autonomous driving, where there is an urgent need for a model capable of effectively handling mixed weather conditions and enhancing image quality in an automated manner. In this paper, we propose a Task Sequence Generator module that, in conjunction with the Task Intra-patch Block, effectively extracts task-specific features embedded in degraded images. The Task Intra-patch Block introduces an external learnable sequence that aids the network in capturing task-specific information. Additionally, we employ a histogram-based transformer module as the backbone of our network, enabling the capture of both global and local dynamic range features. Our proposed model achieves state-of-the-art performance on public datasets.

[CV-43] SDF-Net: A Hybrid Detection Network for Mediastinal Lymph Node Detection on Contrast CT Images

链接: https://arxiv.org/abs/2409.06324
作者: Jiuli Xiong,Lanzhuju Mei,Jiameng Liu,Dinggang Shen,Zhong Xue,Xiaohuan Cao
关键词-EN: Accurate lymph node, impact treatment planning, Accurate lymph, lymph node detection, contrast-enhanced CT images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 4 figures

点击查看摘要

Abstract:Accurate lymph node detection and quantification are crucial for cancer diagnosis and staging on contrast-enhanced CT images, as they impact treatment planning and prognosis. However, detecting lymph nodes in the mediastinal area poses challenges due to their low contrast, irregular shapes and dispersed distribution. In this paper, we propose a Swin-Det Fusion Network (SDF-Net) to effectively detect lymph nodes. SDF-Net integrates features from both segmentation and detection to enhance the detection capability of lymph nodes with various shapes and sizes. Specifically, an auto-fusion module is designed to merge the feature maps of segmentation and detection networks at different levels. To facilitate effective learning without mask annotations, we introduce a shape-adaptive Gaussian kernel to represent lymph node in the training stage and provide more anatomical information for effective learning. Comparative results demonstrate promising performance in addressing the complex lymph node detection problem.

[CV-44] G3PT: Unleash the power of Autoregressive Modeling in 3D Generation via Cross-scale Querying Transformer

链接: https://arxiv.org/abs/2409.06322
作者: Jinzhi Zhang,Feng Xiong,Mu Xu
关键词-EN: shown substantial promise, revolutionized generative models, language processing, processing and shown, shown substantial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 5 figures

点击查看摘要

Abstract:Autoregressive transformers have revolutionized generative models in language processing and shown substantial promise in image and video generation. However, these models face significant challenges when extended to 3D generation tasks due to their reliance on next-token prediction to learn token sequences, which is incompatible with the unordered nature of 3D data. Instead of imposing an artificial order on 3D data, in this paper, we introduce G3PT, a scalable coarse-to-fine 3D generative model utilizing a cross-scale querying transformer. The key is to map point-based 3D data into discrete tokens with different levels of detail, naturally establishing a sequential relationship between different levels suitable for autoregressive modeling. Additionally, the cross-scale querying transformer connects tokens globally across different levels of detail without requiring an ordered sequence. Benefiting from this approach, G3PT features a versatile 3D generation pipeline that effortlessly supports diverse conditional structures, enabling the generation of 3D shapes from various types of conditions. Extensive experiments demonstrate that G3PT achieves superior generation quality and generalization ability compared to previous 3D generation methods. Most importantly, for the first time in 3D generation, scaling up G3PT reveals distinct power-law scaling behaviors.

[CV-45] Seam Carving as Feature Pooling in CNN

链接: https://arxiv.org/abs/2409.06311
作者: Mohammad Imrul Jubair
关键词-EN: Convolutional Neural Networks, Neural Networks, image classification tasks, Convolutional Neural, technique within Convolutional
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This work investigates the potential of seam carving as a feature pooling technique within Convolutional Neural Networks (CNNs) for image classification tasks. We propose replacing the traditional max pooling layer with a seam carving operation. Our experiments on the Caltech-UCSD Birds 200-2011 dataset demonstrate that the seam carving-based CNN achieves better performance compared to the model utilizing max pooling, based on metrics such as accuracy, precision, recall, and F1-score. We further analyze the behavior of both approaches through feature map visualizations, suggesting that seam carving might preserve more structural information during the pooling process. Additionally, we discuss the limitations of our approach and propose potential future directions for research.

[CV-46] PPMamba: A Pyramid Pooling Local Auxiliary SSM-Based Model for Remote Sensing Image Semantic Segmentation

链接: https://arxiv.org/abs/2409.06309
作者: Yin Hu,Xianping Ma,Jialu Sui,Man-On Pun
关键词-EN: remote sensing, field of remote, space model, Pyramid Pooling Mamba, state space model
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Semantic segmentation is a vital task in the field of remote sensing (RS). However, conventional convolutional neural network (CNN) and transformer-based models face limitations in capturing long-range dependencies or are often computationally intensive. Recently, an advanced state space model (SSM), namely Mamba, was introduced, offering linear computational complexity while effectively establishing long-distance dependencies. Despite their advantages, Mamba-based methods encounter challenges in preserving local semantic information. To cope with these challenges, this paper proposes a novel network called Pyramid Pooling Mamba (PPMamba), which integrates CNN and Mamba for RS semantic segmentation tasks. The core structure of PPMamba, the Pyramid Pooling-State Space Model (PP-SSM) block, combines a local auxiliary mechanism with an omnidirectional state space model (OSS) that selectively scans feature maps from eight directions, capturing comprehensive feature information. Additionally, the auxiliary mechanism includes pyramid-shaped convolutional branches designed to extract features at multiple scales. Extensive experiments on two widely-used datasets, ISPRS Vaihingen and LoveDA Urban, demonstrate that PPMamba achieves competitive performance compared to state-of-the-art models.

[CV-47] High-Performance Few-Shot Segmentation with Foundation Models: An Empirical Study

链接: https://arxiv.org/abs/2409.06305
作者: Shijie Chang,Lihe Zhang,Huchuan Lu
关键词-EN: Existing few-shot segmentation, Existing few-shot, implicit knowledge, FSS, models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: under review

点击查看摘要

Abstract:Existing few-shot segmentation (FSS) methods mainly focus on designing novel support-query matching and self-matching mechanisms to exploit implicit knowledge in pre-trained backbones. However, the performance of these methods is often constrained by models pre-trained on classification tasks. The exploration of what types of pre-trained models can provide more beneficial implicit knowledge for FSS remains limited. In this paper, inspired by the representation consistency of foundational computer vision models, we develop a FSS framework based on foundation models. To be specific, we propose a simple approach to extract implicit knowledge from foundation models to construct coarse correspondence and introduce a lightweight decoder to refine coarse correspondence for fine-grained segmentation. We systematically summarize the performance of various foundation models on FSS and discover that the implicit knowledge within some of these models is more beneficial for FSS than models pre-trained on classification tasks. Extensive experiments on two widely used datasets demonstrate the effectiveness of our approach in leveraging the implicit knowledge of foundation models. Notably, the combination of DINOv2 and DFN exceeds previous state-of-the-art methods by 17.5% on COCO-20i. Code is available at this https URL.

[CV-48] An Attribute-Enriched Dataset and Auto-Annotated Pipeline for Open Detection

链接: https://arxiv.org/abs/2409.06300
作者: Pengfei Qi,Yifei Zhang,Wenqiang Li,Youwen Hu,Kunlong Bai
关键词-EN: Detecting objects, complex to describe, due to perceptual, human annotators, interest through language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Detecting objects of interest through language often presents challenges, particularly with objects that are uncommon or complex to describe, due to perceptual discrepancies between automated models and human annotators. These challenges highlight the need for comprehensive datasets that go beyond standard object labels by incorporating detailed attribute descriptions. To address this need, we introduce the Objects365-Attr dataset, an extension of the existing Objects365 dataset, distinguished by its attribute annotations. This dataset reduces inconsistencies in object detection by integrating a broad spectrum of attributes, including color, material, state, texture and tone. It contains an extensive collection of 5.6M object-level attribute descriptions, meticulously annotated across 1.4M bounding boxes. Additionally, to validate the dataset’s effectiveness, we conduct a rigorous evaluation of YOLO-World at different scales, measuring their detection performance and demonstrating the dataset’s contribution to advancing object detection.

[CV-49] Enhancing Long Video Understanding via Hierarchical Event-Based Memory

点击查看摘要

[CV-50] EntAugment: Entropy-Driven Adaptive Data Augmentation Framework for Image Classification ECCV2024

链接: https://arxiv.org/abs/2409.06290
作者: Suorong Yang,Furao Shen,Jian Zhao
关键词-EN: Data augmentation, improve the generalization, deep neural networks, Data, EntAugment
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:Data augmentation (DA) has been widely used to improve the generalization of deep neural networks. While existing DA methods have proven effective, they often rely on augmentation operations with random magnitudes to each sample. However, this approach can inadvertently introduce noise, induce distribution shifts, and increase the risk of overfitting. In this paper, we propose EntAugment, a tuning-free and adaptive DA framework. Unlike previous work, EntAugment dynamically assesses and adjusts the augmentation magnitudes for each sample during training, leveraging insights into both the inherent complexities of training samples and the evolving status of deep models. Specifically, in EntAugment, the magnitudes are determined by the information entropy derived from the probability distribution obtained by applying the softmax function to the model’s output. In addition, to further enhance the efficacy of EntAugment, we introduce a novel entropy regularization term, EntLoss, which complements the EntAugment approach. Theoretical analysis further demonstrates that EntLoss, compared to traditional cross-entropy loss, achieves closer alignment between the model distributions and underlying dataset distributions. Moreover, EntAugment and EntLoss can be utilized separately or jointly. We conduct extensive experiments across multiple image classification tasks and network architectures with thorough comparisons of existing DA methods. Importantly, the proposed methods outperform others without introducing any auxiliary models or noticeable extra computational costs, highlighting both effectiveness and efficiency. Code is available at this https URL.

[CV-51] Context Enhancement with Reconstruction as Sequence for Unified Unsupervised Anomaly Detection

链接: https://arxiv.org/abs/2409.06285
作者: Hui-Yue Yang,Hui Chen,Lihao Liu,Zijia Lin,Kai Chen,Liejun Wang,Jungong Han,Guiguang Ding
关键词-EN: Unsupervised anomaly detection, train robust detection, robust detection models, anomaly detection, robust detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Unsupervised anomaly detection (AD) aims to train robust detection models using only normal samples, while can generalize well to unseen anomalies. Recent research focuses on a unified unsupervised AD setting in which only one model is trained for all classes, i.e., n-class-one-model paradigm. Feature-reconstruction-based methods achieve state-of-the-art performance in this scenario. However, existing methods often suffer from a lack of sufficient contextual awareness, thereby compromising the quality of the reconstruction. To address this issue, we introduce a novel Reconstruction as Sequence (RAS) method, which enhances the contextual correspondence during feature reconstruction from a sequence modeling perspective. In particular, based on the transformer technique, we integrate a specialized RASFormer block into RAS. This block enables the capture of spatial relationships among different image regions and enhances sequential dependencies throughout the reconstruction process. By incorporating the RASFormer block, our RAS method achieves superior contextual awareness capabilities, leading to remarkable performance. Experimental results show that our RAS significantly outperforms competing methods, well demonstrating the effectiveness and superiority of our method. Our code is available at this https URL.

[CV-52] owards Robust Uncertainty-Aware Incomplete Multi-View Classification

点击查看摘要

[CV-53] Mahalanobis k-NN: A Statistical Lens for Robust Point-Cloud Registrations

链接: https://arxiv.org/abs/2409.06267
作者: Tejas Anvekar,Shivanand Venkanna Sheshappanavar
关键词-EN: discuss Mahalanobis k-NN, statistical lens designed, point cloud, Mahalanobis k-NN, point cloud registration
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we discuss Mahalanobis k-NN: a statistical lens designed to address the challenges of feature matching in learning-based point cloud registration when confronted with an arbitrary density of point clouds, either in the source or target point cloud. We tackle this by adopting Mahalanobis k-NN’s inherent property to capture the distribution of the local neighborhood and surficial geometry. Our method can be seamlessly integrated into any local-graph-based point cloud analysis method. In this paper, we focus on two distinct methodologies: Deep Closest Point (DCP) and Deep Universal Manifold Embedding (DeepUME). Our extensive benchmarking on the ModelNet40 and Faust datasets highlights the efficacy of the proposed method in point cloud registration tasks. Moreover, we establish for the first time that the features acquired through point cloud registration inherently can possess discriminative capabilities. This is evident by a substantial improvement of about 20% in the average accuracy observed in the point cloud few-shot classification task benchmarked on ModelNet40 and ScanObjectNN. The code is publicly available at this https URL

[CV-54] ALSS-YOLO: An Adaptive Lightweight Channel Split and Shuffling Network for TIR Wildlife Detection in UAV Imagery

链接: https://arxiv.org/abs/2409.06259
作者: Ang He,Xiaobo Li,Ximei Wu,Chengyue Su,Jing Chen,Sheng Xu,Xiaobin Guo
关键词-EN: Unmanned aerial vehicles, nocturnal wildlife poaching, combating nocturnal wildlife, Unmanned aerial, TIR UAV wildlife
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Unmanned aerial vehicles (UAVs) equipped with thermal infrared (TIR) cameras play a crucial role in combating nocturnal wildlife poaching. However, TIR images often face challenges such as jitter, and wildlife overlap, necessitating UAVs to possess the capability to identify blurred and overlapping small targets. Current traditional lightweight networks deployed on UAVs struggle to extract features from blurry small targets. To address this issue, we developed ALSS-YOLO, an efficient and lightweight detector optimized for TIR aerial images. Firstly, we propose a novel Adaptive Lightweight Channel Split and Shuffling (ALSS) module. This module employs an adaptive channel split strategy to optimize feature extraction and integrates a channel shuffling mechanism to enhance information exchange between channels. This improves the extraction of blurry features, crucial for handling jitter-induced blur and overlapping targets. Secondly, we developed a Lightweight Coordinate Attention (LCA) module that employs adaptive pooling and grouped convolution to integrate feature information across dimensions. This module ensures lightweight operation while maintaining high detection precision and robustness against jitter and target overlap. Additionally, we developed a single-channel focus module to aggregate the width and height information of each channel into four-dimensional channel fusion, which improves the feature representation efficiency of infrared images. Finally, we modify the localization loss function to emphasize the loss value associated with small objects to improve localization accuracy. Extensive experiments on the BIRDSAI and ISOD TIR UAV wildlife datasets show that ALSS-YOLO achieves state-of-the-art performance, Our code is openly available at this https URL.

[CV-55] st-Time Certifiable Self-Supervision to Bridge the Sim2Real Gap in Event-Based Satellite Pose Estimation IROS2024

链接: https://arxiv.org/abs/2409.06240
作者: Mohsi Jawaid,Rajat Talak,Yasir Latif,Luca Carlone,Tat-Jun Chin
关键词-EN: Deep learning plays, Deep learning, satellite pose estimation, learning plays, plays a critical
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: This work has been accepted for publication at IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024). Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Deep learning plays a critical role in vision-based satellite pose estimation. However, the scarcity of real data from the space environment means that deep models need to be trained using synthetic data, which raises the Sim2Real domain gap problem. A major cause of the Sim2Real gap are novel lighting conditions encountered during test time. Event sensors have been shown to provide some robustness against lighting variations in vision-based pose estimation. However, challenging lighting conditions due to strong directional light can still cause undesirable effects in the output of commercial off-the-shelf event sensors, such as noisy/spurious events and inhomogeneous event densities on the object. Such effects are non-trivial to simulate in software, thus leading to Sim2Real gap in the event domain. To close the Sim2Real gap in event-based satellite pose estimation, the paper proposes a test-time self-supervision scheme with a certifier module. Self-supervision is enabled by an optimisation routine that aligns a dense point cloud of the predicted satellite pose with the event data to attempt to rectify the inaccurately estimated pose. The certifier attempts to verify the corrected pose, and only certified test-time inputs are backpropagated via implicit differentiation to refine the predicted landmarks, thus improving the pose estimates and closing the Sim2Real gap. Results show that the our method outperforms established test-time adaptation schemes.

[CV-56] Recurrent Neural Networks for Still Images

链接: https://arxiv.org/abs/2409.06235
作者: Dmitri(Dima)Lvov,Yair Smadar,Ran Bezen
关键词-EN: Recurrent Neural Network, Convolutional Neural Networks, Neural Networks, Convolutional Recurrent Neural, Recurrent Neural
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we explore the application of Recurrent Neural Network (RNN) for still images. Typically, Convolutional Neural Networks (CNNs) are the prevalent method applied for this type of data, and more recently, transformers have gained popularity, although they often require large models. Unlike these methods, RNNs are generally associated with processing sequences over time rather than single images. We argue that RNNs can effectively handle still images by interpreting the pixels as a sequence. This approach could be particularly advantageous for compact models designed for embedded systems, where resources are limited. Additionally, we introduce a novel RNN design tailored for two-dimensional inputs, such as images, and a custom version of BiDirectional RNN (BiRNN) that is more memory-efficient than traditional implementations. In our research, we have tested these layers in Convolutional Recurrent Neural Networks (CRNNs), predominantly composed of Conv2D layers, with RNN layers at or close to the end. Experiments on the COCO and CIFAR100 datasets show better results, particularly for small networks.

[CV-57] A Latent Implicit 3D Shape Model for Multiple Levels of Detail

链接: https://arxiv.org/abs/2409.06231
作者: Benoit Guillard,Marc Habermann,Christian Theobalt,Pascal Fua
关键词-EN: neural representations map, shape-specific latent code, signed distance, levels of detail, representations map
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Published in GCPR 2024 proceedings

点击查看摘要

Abstract:Implicit neural representations map a shape-specific latent code and a 3D coordinate to its corresponding signed distance (SDF) value. However, this approach only offers a single level of detail. Emulating low levels of detail can be achieved with shallow networks, but the generated shapes are typically not smooth. Alternatively, some network designs offer multiple levels of detail, but are limited to overfitting a single object. To address this, we propose a new shape modeling approach, which enables multiple levels of detail and guarantees a smooth surface at each level. At the core, we introduce a novel latent conditioning for a multiscale and bandwith-limited neural architecture. This results in a deep parameterization of multiple shapes, where early layers quickly output approximated SDF values. This allows to balance speed and accuracy within a single network and enhance the efficiency of implicit scene rendering. We demonstrate that by limiting the bandwidth of the network, we can maintain smooth surfaces across all levels of detail. At finer levels, reconstruction quality is on par with the state of the art models, which are limited to a single level of detail. Comments: Published in GCPR 2024 proceedings Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.06231 [cs.CV] (or arXiv:2409.06231v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.06231 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-58] MIP-GAF: A MLLM-annotated Benchmark for Most Important Person Localization and Group Context Understanding WACV2025

链接: https://arxiv.org/abs/2409.06224
作者: Surbhi Madan,Shreya Ghosh,Lownish Rai Sookha,M.A. Ganaie,Ramanathan Subramanian,Abhinav Dhall,Tom Gedeon
关键词-EN: Important Person, social event setup, event setup, due to contextual, contextual complexity
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: Accepted for publication at WACV 2025

点击查看摘要

Abstract:Estimating the Most Important Person (MIP) in any social event setup is a challenging problem mainly due to contextual complexity and scarcity of labeled data. Moreover, the causality aspects of MIP estimation are quite subjective and diverse. To this end, we aim to address the problem by annotating a large-scale in-the-wild' dataset for identifying human perceptions about the Most Important Person (MIP)’ in an image. The paper provides a thorough description of our proposed Multimodal Large Language Model (MLLM) based data annotation strategy, and a thorough data quality analysis. Further, we perform a comprehensive benchmarking of the proposed dataset utilizing state-of-the-art MIP localization methods, indicating a significant drop in performance compared to existing datasets. The performance drop shows that the existing MIP localization algorithms must be more robust with respect to `in-the-wild’ situations. We believe the proposed dataset will play a vital role in building the next-generation social situation understanding methods. The code and data is available at this https URL.

[CV-59] CerviXpert: A Multi-Structural Convolutional Neural Network for Predicting Cervix Type and Cervical Cell Abnormalities

点击查看摘要

[CV-60] Denoising: A Powerful Building-Block for Imaging Inverse Problems and Machine Learning

链接: https://arxiv.org/abs/2409.06219
作者: Peyman Milanfar,Mauricio Delbracio
关键词-EN: reducing random fluctuations, modern scientific inquiry, process of reducing, reducing random, random fluctuations
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Denoising, the process of reducing random fluctuations in a signal to emphasize essential patterns, has been a fundamental problem of interest since the dawn of modern scientific inquiry. Recent denoising techniques, particularly in imaging, have achieved remarkable success, nearing theoretical limits by some measures. Yet, despite tens of thousands of research papers, the wide-ranging applications of denoising beyond noise removal have not been fully recognized. This is partly due to the vast and diverse literature, making a clear overview challenging. This paper aims to address this gap. We present a comprehensive perspective on denoisers, their structure, and desired properties. We emphasize the increasing importance of denoising and showcase its evolution into an essential building block for complex tasks in imaging, inverse problems, and machine learning. Despite its long history, the community continues to uncover unexpected and groundbreaking uses for denoising, further solidifying its place as a cornerstone of scientific and engineering practice. Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV) Cite as: arXiv:2409.06219 [cs.LG] (or arXiv:2409.06219v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.06219 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-61] DACAT: Dual-stream Adaptive Clip-aware Time Modeling for Robust Online Surgical Phase Recognition

链接: https://arxiv.org/abs/2409.06217
作者: Kaixiang Yang,Qiang Li,Zhiwei Wang
关键词-EN: surgical risk forecasting, Surgical phase, Surgical phase recognition, laparoscopic surgery, enabling various clinical
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 4 figures

点击查看摘要

Abstract:Surgical phase recognition has become a crucial requirement in laparoscopic surgery, enabling various clinical applications like surgical risk forecasting. Current methods typically identify the surgical phase using individual frame-wise embeddings as the fundamental unit for time modeling. However, this approach is overly sensitive to current observations, often resulting in discontinuous and erroneous predictions within a complete surgical phase. In this paper, we propose DACAT, a novel dual-stream model that adaptively learns clip-aware context information to enhance the temporal relationship. In one stream, DACAT pretrains a frame encoder, caching all historical frame-wise features. In the other stream, DACAT fine-tunes a new frame encoder to extract the frame-wise feature at the current moment. Additionally, a max clip-response read-out (Max-R) module is introduced to bridge the two streams by using the current frame-wise feature to adaptively fetch the most relevant past clip from the feature cache. The clip-aware context feature is then encoded via cross-attention between the current frame and its fetched adaptive clip, and further utilized to enhance the time modeling for accurate online surgical phase recognition. The benchmark results on three public datasets, i.e., Cholec80, M2CAI16, and AutoLaparo, demonstrate the superiority of our proposed DACAT over existing state-of-the-art methods, with improvements in Jaccard scores of at least 4.5%, 4.6%, and 2.7%, respectively. Our code and models have been released at this https URL.

[CV-62] owards Generalizable Scene Change Detection

点击查看摘要

[CV-63] INTRA: Interaction Relationship-aware Weakly Supervised Affordance Grounding

链接: https://arxiv.org/abs/2409.06210
作者: Ji Ha Jang,Hoigi Seo,Se Young Chun
关键词-EN: supervised affordance grounding, Weakly supervised affordance, affordance grounding, Affordance, potential interactions inherent
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Affordance denotes the potential interactions inherent in objects. The perception of affordance can enable intelligent agents to navigate and interact with new environments efficiently. Weakly supervised affordance grounding teaches agents the concept of affordance without costly pixel-level annotations, but with exocentric images. Although recent advances in weakly supervised affordance grounding yielded promising results, there remain challenges including the requirement for paired exocentric and egocentric image dataset, and the complexity in grounding diverse affordances for a single object. To address them, we propose INTeraction Relationship-aware weakly supervised Affordance grounding (INTRA). Unlike prior arts, INTRA recasts this problem as representation learning to identify unique features of interactions through contrastive learning with exocentric images only, eliminating the need for paired datasets. Moreover, we leverage vision-language model embeddings for performing affordance grounding flexibly with any text, designing text-conditioned affordance map generation to reflect interaction relationship for contrastive learning and enhancing robustness with our text synonym augmentation. Our method outperformed prior arts on diverse datasets such as AGD20K, IIT-AFF, CAD and UMD. Additionally, experimental results demonstrate that our method has remarkable domain scalability for synthesized images / illustrations and is capable of performing affordance grounding for novel interactions and objects.

[CV-64] AgileIR: Memory-Efficient Group Shifted Windows Attention for Agile Image Restoration

链接: https://arxiv.org/abs/2409.06206
作者: Hongyi Cai,Mohammad Mahdinur Rahman,Mohammad Shahid Akhtar,Jie Li,Jingyu Wu,Zhili Fang
关键词-EN: Image Restoration tasks, Image Transformers show, Image Restoration, Restoration tasks, Image Transformers
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image Transformers show a magnificent success in Image Restoration tasks. Nevertheless, most of transformer-based models are strictly bounded by exorbitant memory occupancy. Our goal is to reduce the memory consumption of Swin Transformer and at the same time speed up the model during training process. Thus, we introduce AgileIR, group shifted attention mechanism along with window attention, which sparsely simplifies the model in architecture. We propose Group Shifted Window Attention (GSWA) to decompose Shift Window Multi-head Self Attention (SW-MSA) and Window Multi-head Self Attention (W-MSA) into groups across their attention heads, contributing to shrinking memory usage in back propagation. In addition to that, we keep shifted window masking and its shifted learnable biases during training, in order to induce the model interacting across windows within the channel. We also re-allocate projection parameters to accelerate attention matrix calculation, which we found a negligible decrease in performance. As a result of experiment, compared with our baseline SwinIR and other efficient quantization models, AgileIR keeps the performance still at 32.20 dB on Set5 evaluation dataset, exceeding other methods with tailor-made efficient methods and saves over 50% memory while a large batch size is employed.

[CV-65] RealisDance: Equip controllable character animation with realistic hands

链接: https://arxiv.org/abs/2409.06202
作者: Jingkai Zhou,Benzhi Wang,Weihua Chen,Jingqi Bai,Dongyang Li,Aixi Zhang,Hao Xu,Mingyang Yang,Fan Wang
关键词-EN: Controllable character animation, Controllable character, character videos controlled, pose, emerging task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical Report

点击查看摘要

Abstract:Controllable character animation is an emerging task that generates character videos controlled by pose sequences from given character images. Although character consistency has made significant progress via reference UNet, another crucial factor, pose control, has not been well studied by existing methods yet, resulting in several issues: 1) The generation may fail when the input pose sequence is corrupted. 2) The hands generated using the DWPose sequence are blurry and unrealistic. 3) The generated video will be shaky if the pose sequence is not smooth enough. In this paper, we present RealisDance to handle all the above issues. RealisDance adaptively leverages three types of poses, avoiding failed generation caused by corrupted pose sequences. Among these pose types, HaMeR provides accurate 3D and depth information of hands, enabling RealisDance to generate realistic hands even for complex gestures. Besides using temporal attention in the main UNet, RealisDance also inserts temporal attention into the pose guidance network, smoothing the video from the pose condition aspect. Moreover, we introduce pose shuffle augmentation during training to further improve generation robustness and video smoothness. Qualitative experiments demonstrate the superiority of RealisDance over other existing methods, especially in hand quality.

[CV-66] Deep kernel representations of latent space features for low-dose PET-MR imaging robust to variable dose reduction

链接: https://arxiv.org/abs/2409.06198
作者: Cameron Dennis Pain,Yasmeen George,Alex Fornito,Gary Egan,Zhaolin Chen
关键词-EN: positron emission tomography, Low-dose positron emission, emission tomography, imaging modality, significantly improve PET
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 19 pages, 15 figures, 4 tables, Submitted to IEEE Transactions on Medical Imaging

点击查看摘要

Abstract:Low-dose positron emission tomography (PET) image reconstruction methods have potential to significantly improve PET as an imaging modality. Deep learning provides a promising means of incorporating prior information into the image reconstruction problem to produce quantitatively accurate images from compromised signal. Deep learning-based methods for low-dose PET are generally poorly conditioned and perform unreliably on images with features not present in the training distribution. We present a method which explicitly models deep latent space features using a robust kernel representation, providing robust performance on previously unseen dose reduction factors. Additional constraints on the information content of deep latent features allow for tuning in-distribution accuracy and generalisability. Tests with out-of-distribution dose reduction factors ranging from \times 10 to \times 1000 and with both paired and unpaired MR, demonstrate significantly improved performance relative to conventional deep-learning methods trained using the same data. Code:this https URL

[CV-67] UdeerLID: Integrating LiDAR Image and Relative Depth with Semi-Supervised

链接: https://arxiv.org/abs/2409.06197
作者: Tao Ni,Xin Zhan,Tao Luo,Wenbin Liu,Zhan Shi,JunBo Chen
关键词-EN: autonomous driving systems, classify road surfaces, driving systems, requiring accurate, critical task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Road segmentation is a critical task for autonomous driving systems, requiring accurate and robust methods to classify road surfaces from various environmental data. Our work introduces an innovative approach that integrates LiDAR point cloud data, visual image, and relative depth maps derived from images. The integration of multiple data sources in road segmentation presents both opportunities and challenges. One of the primary challenges is the scarcity of large-scale, accurately labeled datasets that are necessary for training robust deep learning models. To address this, we have developed the [UdeerLID+] framework under a semi-supervised learning paradigm. Experiments results on KITTI datasets validate the superior performance.

[CV-68] MyGo: Consistent and Controllable Multi-View Driving Video Generation with Camera Control

链接: https://arxiv.org/abs/2409.06189
作者: Yining Yao,Xi Guo,Chenjing Ding,Wei Wu
关键词-EN: High-quality driving video, providing training data, High-quality driving, driving video generation, video generation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:High-quality driving video generation is crucial for providing training data for autonomous driving models. However, current generative models rarely focus on enhancing camera motion control under multi-view tasks, which is essential for driving video generation. Therefore, we propose MyGo, an end-to-end framework for video generation, introducing motion of onboard cameras as conditions to make progress in camera controllability and multi-view consistency. MyGo employs additional plug-in modules to inject camera parameters into the pre-trained video diffusion model, which retains the extensive knowledge of the pre-trained model as much as possible. Furthermore, we use epipolar constraints and neighbor view information during the generation process of each view to enhance spatial-temporal consistency. Experimental results show that MyGo has achieved state-of-the-art results in both general camera-controlled video generation and multi-view driving video generation tasks, which lays the foundation for more accurate environment simulation in autonomous driving. Project page: \hrefthis https URLthis http URL_project/MyGo/page.html

[CV-69] Bottleneck-based Encoder-decoder ARchitecture (BEAR) for Learning Unbiased Consumer-to-Consumer Image Representations ICML ALT

链接: https://arxiv.org/abs/2409.06187
作者: Pablo Rivas,Gisela Bichler,Tomas Cerny,Laurie Giddens,Stacie Petter
关键词-EN: Unbiased representation learning, Unbiased representation, applications and contexts, representation learning, object of study
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 2022 LXAI Workshop at the 39th International Conference on Machine Learning (ICML), Baltimore, Maryland

点击查看摘要

Abstract:Unbiased representation learning is still an object of study under specific applications and contexts. Novel architectures are usually crafted to resolve particular problems using mixtures of fundamental pieces. This paper presents different image feature extraction mechanisms that work together with residual connections to encode perceptual image information in an autoencoder configuration. We use image data that aims to support a larger research agenda dealing with issues regarding criminal activity in consumer-to-consumer online platforms. Preliminary results suggest that the proposed architecture can learn rich spaces using ours and other image datasets resolving important challenges that are identified.

[CV-70] EDADepth: Enhanced Data Augmentation for Monocular Depth Estimation

链接: https://arxiv.org/abs/2409.06183
作者: Nischal Khanal,Shivanand Venkanna Sheshappanavar
关键词-EN: visual perception tasks, synthesis feature, diffusion models, perception tasks, rise in visual
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Due to their text-to-image synthesis feature, diffusion models have recently seen a rise in visual perception tasks, such as depth estimation. The lack of good-quality datasets makes the extraction of a fine-grain semantic context challenging for the diffusion models. The semantic context with fewer details further worsens the process of creating effective text embeddings that will be used as input for diffusion models. In this paper, we propose a novel EDADepth, an enhanced data augmentation method to estimate monocular depth without using additional training data. We use Swin2SR, a super-resolution model, to enhance the quality of input images. We employ the BEiT pre-trained semantic segmentation model for better extraction of text embeddings. We introduce BLIP-2 tokenizer to generate tokens from these text embeddings. The novelty of our approach is the introduction of Swin2SR, the BEiT model, and the BLIP-2 tokenizer in the diffusion-based pipeline for the monocular depth estimation. Our model achieves state-of-the-art results (SOTA) on the \delta3 metric on NYUv2 and KITTI datasets. It also achieves results comparable to those of the SOTA models in the RMSE and REL metrics. Finally, we also show improvements in the visualization of the estimated depth compared to the SOTA diffusion-based monocular depth estimation models. Code: this https URL.

[CV-71] Loss Distillation via Gradient Matching for Point Cloud Completion with Weighted Chamfer Distance IROS

链接: https://arxiv.org/abs/2409.06171
作者: Fangzhou Lin,Haotian Liu,Haoying Zhou,Songlin Hou,Kazunori D Yamada,Gregory S. Fischer,Yanhua Li,Haichong K. Zhang,Ziming Zhang
关键词-EN: grasp pose detection, point clouds enhanced, scene understanding, enhanced the robot, robot ability
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 10 pages, 7 figures, 7 tables, this paper was accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2024

点击查看摘要

Abstract:3D point clouds enhanced the robot’s ability to perceive the geometrical information of the environments, making it possible for many downstream tasks such as grasp pose detection and scene understanding. The performance of these tasks, though, heavily relies on the quality of data input, as incomplete can lead to poor results and failure cases. Recent training loss functions designed for deep learning-based point cloud completion, such as Chamfer distance (CD) and its variants (\eg HyperCD ), imply a good gradient weighting scheme can significantly boost performance. However, these CD-based loss functions usually require data-related parameter tuning, which can be time-consuming for data-extensive tasks. To address this issue, we aim to find a family of weighted training losses (\em weighted CD) that requires no parameter tuning. To this end, we propose a search scheme, \em Loss Distillation via Gradient Matching, to find good candidate loss functions by mimicking the learning behavior in backpropagation between HyperCD and weighted CD. Once this is done, we propose a novel bilevel optimization formula to train the backbone network based on the weighted CD loss. We observe that: (1) with proper weighted functions, the weighted CD can always achieve similar performance to HyperCD, and (2) the Landau weighted CD, namely \em Landau CD, can outperform HyperCD for point cloud completion and lead to new state-of-the-art results on several benchmark datasets. \it Our demo code is available at \urlthis https URL.

[CV-72] Revisiting Prompt Pretraining of Vision-Language Models

链接: https://arxiv.org/abs/2409.06166
作者: Zhenyuan Chen,Lingfeng Yang,Shuo Chen,Zhaowei Chen,Jiajun Liang,Xiang Li
关键词-EN: customize Vision-Language Models, Prompt, input prompt tokens, prompt pretraining, Revisiting Prompt Pretraining
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Prompt learning is an effective method to customize Vision-Language Models (VLMs) for various downstream tasks, involving tuning very few parameters of input prompt tokens. Recently, prompt pretraining in large-scale dataset (e.g., ImageNet-21K) has played a crucial role in prompt learning for universal visual discrimination. However, we revisit and observe that the limited learnable prompts could face underfitting risks given the extensive images during prompt pretraining, simultaneously leading to poor generalization. To address the above issues, in this paper, we propose a general framework termed Revisiting Prompt Pretraining (RPP), which targets at improving the fitting and generalization ability from two aspects: prompt structure and prompt supervision. For prompt structure, we break the restriction in common practice where query, key, and value vectors are derived from the shared learnable prompt token. Instead, we introduce unshared individual query, key, and value learnable prompts, thereby enhancing the model’s fitting capacity through increased parameter diversity. For prompt supervision, we additionally utilize soft labels derived from zero-shot probability predictions provided by a pretrained Contrastive Language Image Pretraining (CLIP) teacher model. These soft labels yield more nuanced and general insights into the inter-class relationships, thereby endowing the pretraining process with better generalization ability. RPP produces a more resilient prompt initialization, enhancing its robust transferability across diverse visual recognition tasks. Experiments across various benchmarks consistently confirm the state-of-the-art (SOTA) performance of our pretrained prompts. Codes and models will be made available soon.

[CV-73] UniLearn: Enhancing Dynamic Facial Expression Recognition through Unified Pre-Training and Fine-Tuning on Images and Videos

链接: https://arxiv.org/abs/2409.06154
作者: Yin Chen,Jia Li,Yu Zhang,Zhenzhen Hu,Shiguang Shan,Meng Wang,Richang Hong
关键词-EN: understanding human emotions, facial expression recognition, dynamic facial data, Dynamic facial, Dynamic facial expression
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Dynamic facial expression recognition (DFER) is essential for understanding human emotions and behavior. However, conventional DFER methods, which primarily use dynamic facial data, often underutilize static expression images and their labels, limiting their performance and robustness. To overcome this, we introduce UniLearn, a novel unified learning paradigm that integrates static facial expression recognition (SFER) data to enhance DFER task. UniLearn employs a dual-modal self-supervised pre-training method, leveraging both facial expression images and videos to enhance a ViT model’s spatiotemporal representation capability. Then, the pre-trained model is fine-tuned on both static and dynamic expression datasets using a joint fine-tuning strategy. To prevent negative transfer during joint fine-tuning, we introduce an innovative Mixture of Adapter Experts (MoAE) module that enables task-specific knowledge acquisition and effectively integrates information from both static and dynamic expression data. Extensive experiments demonstrate UniLearn’s effectiveness in leveraging complementary information from static and dynamic facial data, leading to more accurate and robust DFER. UniLearn consistently achieves state-of-the-art performance on FERV39K, MAFW, and DFEW benchmarks, with weighted average recall (WAR) of 53.65%, 58.44%, and 76.68%, respectively. The source code and model weights will be publicly available at \urlthis https URL.

[CV-74] Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis

点击查看摘要

[CV-75] DECOLLAGE: 3D Detailization by Controllable Localized and Learned Geometry Enhancement ECCV2024

链接: https://arxiv.org/abs/2409.06129
作者: Qimin Chen,Zhiqin Chen,Vladimir G. Kim,Noam Aigerman,Hao Zhang,Siddhartha Chaudhuri
关键词-EN: content creation, refine or detailize, machine learning, expanding the capabilities, capabilities of AI-assisted
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: ECCV 2024 (poster). Code: this https URL

点击查看摘要

Abstract:We present a 3D modeling method which enables end-users to refine or detailize 3D shapes using machine learning, expanding the capabilities of AI-assisted 3D content creation. Given a coarse voxel shape (e.g., one produced with a simple box extrusion tool or via generative modeling), a user can directly “paint” desired target styles representing compelling geometric details, from input exemplar shapes, over different regions of the coarse shape. These regions are then up-sampled into high-resolution geometries which adhere with the painted styles. To achieve such controllable and localized 3D detailization, we build on top of a Pyramid GAN by making it masking-aware. We devise novel structural losses and priors to ensure that our method preserves both desired coarse structures and fine-grained features even if the painted styles are borrowed from diverse sources, e.g., different semantic parts and even different shape categories. Through extensive experiments, we show that our ability to localize details enables novel interactive creative workflows and applications. Our experiments further demonstrate that in comparison to prior techniques built on global detailization, our method generates structure-preserving, high-resolution stylized geometries with more coherent shape details and style transitions.

[CV-76] PaRCE: Probabilistic and Reconstruction-Based Competency Estimation for Safe Navigation Under Perception Uncertainty

点击查看摘要

[CV-77] SGC-VQGAN: Towards Complex Scene Representation via Semantic Guided Clustering Codebook

链接: https://arxiv.org/abs/2409.06105
作者: Chenjing Ding,Chiyu Wang,Boshi Liu,Xi Guo,Weixuan Tang,Wei Wu
关键词-EN: Vector quantization, discrete codebook representations, Vector, deterministically learning features, learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vector quantization (VQ) is a method for deterministically learning features through discrete codebook representations. Recent works have utilized visual tokenizers to discretize visual regions for self-supervised representation learning. However, a notable limitation of these tokenizers is lack of semantics, as they are derived solely from the pretext task of reconstructing raw image pixels in an auto-encoder paradigm. Additionally, issues like imbalanced codebook distribution and codebook collapse can adversely impact performance due to inefficient codebook utilization. To address these challenges, We introduce SGC-VQGAN through Semantic Online Clustering method to enhance token semantics through Consistent Semantic Learning. Utilizing inference results from segmentation model , our approach constructs a temporospatially consistent semantic codebook, addressing issues of codebook collapse and imbalanced token semantics. Our proposed Pyramid Feature Learning pipeline integrates multi-level features to capture both image details and semantics simultaneously. As a result, SGC-VQGAN achieves SOTA performance in both reconstruction quality and various downstream tasks. Its simplicity, requiring no additional parameter learning, enables its direct application in downstream tasks, presenting significant potential.

[CV-78] LSE-NeRF: Learning Sensor Modeling Errors for Deblured Neural Radiance Fields with RGB-Event Stereo

链接: https://arxiv.org/abs/2409.06104
作者: Wei Zhi Tang,Daniel Rebain,Kostantinos G. Derpanis,Kwang Moo Yi
关键词-EN: Neural Radiance Field, clear Neural Radiance, Radiance Field, Neural Radiance, fast camera motions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present a method for reconstructing a clear Neural Radiance Field (NeRF) even with fast camera motions. To address blur artifacts, we leverage both (blurry) RGB images and event camera data captured in a binocular configuration. Importantly, when reconstructing our clear NeRF, we consider the camera modeling imperfections that arise from the simple pinhole camera model as learned embeddings for each camera measurement, and further learn a mapper that connects event camera measurements with RGB data. As no previous dataset exists for our binocular setting, we introduce an event camera dataset with captures from a 3D-printed stereo configuration between RGB and event cameras. Empirically, we evaluate our introduced dataset and EVIMOv2 and show that our method leads to improved reconstructions. Our code and dataset are available at this https URL.

[CV-79] SVS-GAN: Leveraging GANs for Semantic Video Synthesis

链接: https://arxiv.org/abs/2409.06074
作者: Khaled M. Seyam,Julian Wiederer,Markus Braun,Bin Yang
关键词-EN: Generative Adversarial Networks, Generative Adversarial, Semantic Video Synthesis, Semantic Image Synthesis, Adversarial Networks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent years, there has been a growing interest in Semantic Image Synthesis (SIS) through the use of Generative Adversarial Networks (GANs) and diffusion models. This field has seen innovations such as the implementation of specialized loss functions tailored for this task, diverging from the more general approaches in Image-to-Image (I2I) translation. While the concept of Semantic Video Synthesis (SVS) \unicodex2013 the generation of temporally coherent, realistic sequences of images from semantic maps \unicodex2013 is newly formalized in this paper, some existing methods have already explored aspects of this field. Most of these approaches rely on generic loss functions designed for video-to-video translation or require additional data to achieve temporal coherence. In this paper, we introduce the SVS-GAN, a framework specifically designed for SVS, featuring a custom architecture and loss functions. Our approach includes a triple-pyramid generator that utilizes SPADE blocks. Additionally, we employ a U-Net-based network for the image discriminator, which performs semantic segmentation for the OASIS loss. Through this combination of tailored architecture and objective engineering, our framework aims to bridge the existing gap between SIS and SVS, outperforming current state-of-the-art models on datasets like Cityscapes and KITTI-360.

[CV-80] DiffusionPen: Towards Controlling the Style of Handwritten Text Generation

链接: https://arxiv.org/abs/2409.06065
作者: Konstantina Nikolaidou,George Retsinas,Giorgos Sfikas,Marcus Liwicki
关键词-EN: challenging task due, Handwritten Text Generation, Latent Diffusion Models, Diffusion Models, Text Generation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Handwritten Text Generation (HTG) conditioned on text and style is a challenging task due to the variability of inter-user characteristics and the unlimited combinations of characters that form new words unseen during training. Diffusion Models have recently shown promising results in HTG but still remain under-explored. We present DiffusionPen (DiffPen), a 5-shot style handwritten text generation approach based on Latent Diffusion Models. By utilizing a hybrid style extractor that combines metric learning and classification, our approach manages to capture both textual and stylistic characteristics of seen and unseen words and styles, generating realistic handwritten samples. Moreover, we explore several variation strategies of the data with multi-style mixtures and noisy embeddings, enhancing the robustness and diversity of the generated data. Extensive experiments using IAM offline handwriting database show that our method outperforms existing methods qualitatively and quantitatively, and its additional generated data can improve the performance of Handwriting Text Recognition (HTR) systems. The code is available at: this https URL.

[CV-81] Online 3D reconstruction and dense tracking in endoscopic videos

链接: https://arxiv.org/abs/2409.06037
作者: Michel Hayoz,Christopher Hahne,Thomas Kurmann,Max Allan,Guido Beldi,Daniel Candinas,ablo Márquez-Neila,Raphael Sznitman
关键词-EN: stereo endoscopic video, endoscopic video data, stereo endoscopic, endoscopic video, video data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D scene reconstruction from stereo endoscopic video data is crucial for advancing surgical interventions. In this work, we present an online framework for online, dense 3D scene reconstruction and tracking, aimed at enhancing surgical scene understanding and assisting interventions. Our method dynamically extends a canonical scene representation using Gaussian splatting, while modeling tissue deformations through a sparse set of control points. We introduce an efficient online fitting algorithm that optimizes the scene parameters, enabling consistent tracking and accurate reconstruction. Through experiments on the StereoMIS dataset, we demonstrate the effectiveness of our approach, outperforming state-of-the-art tracking methods and achieving comparable performance to offline reconstruction techniques. Our work enables various downstream applications thus contributing to advancing the capabilities of surgical assistance systems.

[CV-82] NESI: Shape Representation via Neural Explicit Surface Intersection

链接: https://arxiv.org/abs/2409.06030
作者: Congyi Zhang,Jinfan Yang,Eric Hedlin,Suzuran Takikawa,Nicholas Vining,Kwang Moo Yi,Wenping Wang,Alla Sheffer
关键词-EN: digital media applications, compressed form, processed efficiently directly, Compressed representations, media applications
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Compressed representations of 3D shapes that are compact, accurate, and can be processed efficiently directly in compressed form, are extremely useful for digital media applications. Recent approaches in this space focus on learned implicit or parametric representations. While implicits are well suited for tasks such as in-out queries, they lack natural 2D parameterization, complicating tasks such as texture or normal mapping. Conversely, parametric representations support the latter tasks but are ill-suited for occupancy queries. We propose a novel learned alternative to these approaches, based on intersections of localized explicit, or height-field, surfaces. Since explicits can be trivially expressed both implicitly and parametrically, NESI directly supports a wider range of processing operations than implicit alternatives, including occupancy queries and parametric access. We represent input shapes using a collection of differently oriented height-field bounded half-spaces combined using volumetric Boolean intersections. We first tightly bound each input using a pair of oppositely oriented height-fields, forming a Double Height-Field (DHF) Hull. We refine this hull by intersecting it with additional localized height-fields (HFs) that capture surface regions in its interior. We minimize the number of HFs necessary to accurately capture each input and compactly encode both the DHF hull and the local HFs as neural functions defined over subdomains of R^2. This reduced dimensionality encoding delivers high-quality compact approximations. Given similar parameter count, or storage capacity, NESI significantly reduces approximation error compared to the state of the art, especially at lower parameter counts.

[CV-83] Improved Visually Prompted Keyword Localisation in Real Low-Resource Settings

链接: https://arxiv.org/abs/2409.06013
作者: Leanne Nortje,Dan Oneata,Herman Kamper
关键词-EN: prompted keyword localisation, visually prompted keyword, keyword localisation, aims to find, prompted keyword
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-84] Enhanced Generative Data Augmentation for Semantic Segmentation via Stronger Guidance

链接: https://arxiv.org/abs/2409.06002
作者: Quang-Huy Che,Duc-Tri Le,Vinh-Tiep Nguyen
关键词-EN: Data, Data augmentation, images, semantic, semantic segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Data augmentation is a widely used technique for creating training data for tasks that require labeled data, such as semantic segmentation. This method benefits pixel-wise annotation tasks requiring much effort and intensive labor. Traditional data augmentation methods involve simple transformations like rotations and flips to create new images from existing ones. However, these new images may lack diversity along the main semantic axes in the data and not change high-level semantic properties. To address this issue, generative models have emerged as an effective solution for augmenting data by generating synthetic images. Controllable generative models offer a way to augment data for semantic segmentation tasks using a prompt and visual reference from the original image. However, using these models directly presents challenges, such as creating an effective prompt and visual reference to generate a synthetic image that accurately reflects the content and structure of the original. In this work, we introduce an effective data augmentation method for semantic segmentation using the Controllable Diffusion Model. Our proposed method includes efficient prompt generation using Class-Prompt Appending and Visual Prior Combination to enhance attention to labeled classes in real images. These techniques allow us to generate images that accurately depict segmented classes in the real image. In addition, we employ the class balancing algorithm to ensure efficiency when merging the synthetic and original images to generate balanced data for the training dataset. We evaluated our method on the PASCAL VOC datasets and found it highly effective for synthesizing images in semantic segmentation.

[CV-85] Advance and Refinement: The Evolution of UAV Detection and Classification Technologies

链接: https://arxiv.org/abs/2409.05985
作者: Vladislav Semenyuk,Ildar Kurmashev,Alberto Lupidi,Dmitriy Alyoshin,Liliya Kurmasheva,Alessandro Cantelli-Forti
关键词-EN: unmanned aerial vehicle, aerial vehicle, detailed analysis, advancements in unmanned, unmanned aerial
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
*备注: 19 pages, 5 figures

点击查看摘要

Abstract:This review provides a detailed analysis of the advancements in unmanned aerial vehicle (UAV) detection and classification systems from 2020 to today. It covers various detection methodologies such as radar, radio frequency, optical, and acoustic sensors, and emphasizes their integration via sophisticated sensor fusion techniques. The fundamental technologies driving UAV detection and classification are thoroughly examined, with a focus on their accuracy and range. Additionally, the paper discusses the latest innovations in artificial intelligence and machine learning, illustrating their impact on improving the accuracy and efficiency of these systems. The review concludes by predicting further technological developments in UAV detection, which are expected to enhance both performance and reliability.

[CV-86] owards Narrowing the Generalization Gap in Deep Boolean Networks

链接: https://arxiv.org/abs/2409.05905
作者: Youngsung Kim
关键词-EN: increased computational demands, sharply increased computational, Boolean networks, real-world scenarios, rapid growth
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The rapid growth of the size and complexity in deep neural networks has sharply increased computational demands, challenging their efficient deployment in real-world scenarios. Boolean networks, constructed with logic gates, offer a hardware-friendly alternative that could enable more efficient implementation. However, their ability to match the performance of traditional networks has remained uncertain. This paper explores strategies to enhance deep Boolean networks with the aim of surpassing their traditional counterparts. We propose novel methods, including logical skip connections and spatiality preserving sampling, and validate them on vision tasks using widely adopted datasets, demonstrating significant improvement over existing approaches. Our analysis shows how deep Boolean networks can maintain high performance while minimizing computational costs through 1-bit logic operations. These findings suggest that Boolean networks are a promising direction for efficient, high-performance deep learning models, with significant potential for advancing hardware-accelerated AI applications.

[CV-87] Memory-Optimized Once-For-All Network

链接: https://arxiv.org/abs/2409.05900
作者: Maxime Girard,Victor Quétu,Samuel Tardieu,Van-Tam Nguyen,Enzo Tartaglione
关键词-EN: Deep Neural Networks, Deploying Deep Neural, varying resource constraints, Neural Architectures Search, Neural Networks
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Deploying Deep Neural Networks (DNNs) on different hardware platforms is challenging due to varying resource constraints. Besides handcrafted approaches aiming at making deep models hardware-friendly, Neural Architectures Search is rising as a toolbox to craft more efficient DNNs without sacrificing performance. Among these, the Once-For-All (OFA) approach offers a solution by allowing the sampling of well-performing sub-networks from a single supernet – this leads to evident advantages in terms of computation. However, OFA does not fully utilize the potential memory capacity of the target device, focusing instead on limiting maximum memory usage per layer. This leaves room for an unexploited potential in terms of model generalizability. In this paper, we introduce a Memory-Optimized OFA (MOOFA) supernet, designed to enhance DNN deployment on resource-limited devices by maximizing memory usage (and for instance, features diversity) across different configurations. Tested on ImageNet, our MOOFA supernet demonstrates improvements in memory exploitation and model accuracy compared to the original OFA supernet. Our code is available at this https URL.

[CV-88] ransformer-Enhanced Iterative Feedback Mechanism for Polyp Segmentation

链接: https://arxiv.org/abs/2409.05875
作者: Nikhil Kumar Tomar,Debesh Jha,Koushik Biswas,Tyler M. Berzin,Rajesh Keswani,Michael Wallace,Ulas Bagci
关键词-EN: United States, Colorectal cancer, cancer-related death, cancer diagnosed, CRC
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Colorectal cancer (CRC) is the third most common cause of cancer diagnosed in the United States and the second leading cause of cancer-related death among both genders. Notably, CRC is the leading cause of cancer in younger men less than 50 years old. Colonoscopy is considered the gold standard for the early diagnosis of CRC. Skills vary significantly among endoscopists, and a high miss rate is reported. Automated polyp segmentation can reduce the missed rates, and timely treatment is possible in the early stage. To address this challenge, we introduce \textit\textbf\acFANetv2, an advanced encoder-decoder network designed to accurately segment polyps from colonoscopy images. Leveraging an initial input mask generated by Otsu thresholding, FANetv2 iteratively refines its binary segmentation masks through a novel feedback attention mechanism informed by the mask predictions of previous epochs. Additionally, it employs a text-guided approach that integrates essential information about the number (one or many) and size (small, medium, large) of polyps to further enhance its feature representation capabilities. This dual-task approach facilitates accurate polyp segmentation and aids in the auxiliary classification of polyp attributes, significantly boosting the model’s performance. Our comprehensive evaluations on the publicly available BKAI-IGH and CVC-ClinicDB datasets demonstrate the superior performance of FANetv2, evidenced by high dice similarity coefficients (DSC) of 0.9186 and 0.9481, along with low Hausdorff distances of 2.83 and 3.19, respectively. The source code for FANetv2 is available at this https URL.

[CV-89] SpecGaussian with Latent Features: A High-quality Modeling of the View-dependent Appearance for 3D Gaussian Splatting

链接: https://arxiv.org/abs/2409.05868
作者: Zhiru Wang,Shiyun Xie,Chengwei Pan,Guoping Wang
关键词-EN: providing real-time rendering, ensuring high-quality rendering, achieved great success, high-quality rendering results, providing real-time
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages,6 figures, 5 tables, ACM Multimedia 2024

点击查看摘要

Abstract:Recently, the 3D Gaussian Splatting (3D-GS) method has achieved great success in novel view synthesis, providing real-time rendering while ensuring high-quality rendering results. However, this method faces challenges in modeling specular reflections and handling anisotropic appearance components, especially in dealing with view-dependent color under complex lighting conditions. Additionally, 3D-GS uses spherical harmonic to learn the color representation, which has limited ability to represent complex scenes. To overcome these challenges, we introduce Lantent-SpecGS, an approach that utilizes a universal latent neural descriptor within each 3D Gaussian. This enables a more effective representation of 3D feature fields, including appearance and geometry. Moreover, two parallel CNNs are designed to decoder the splatting feature maps into diffuse color and specular color separately. A mask that depends on the viewpoint is learned to merge these two colors, resulting in the final rendered image. Experimental results demonstrate that our method obtains competitive performance in novel view synthesis and extends the ability of 3D-GS to handle intricate scenarios with specular reflections.

[CV-90] COLUMBUS: Evaluating COgnitive Lateral Understanding through Multiple-choice reBUSes AAAI-25

点击查看摘要

[CV-91] A study on Deep Convolutional Neural Networks Transfer Learning and Ensemble Model for Breast Cancer Detection

链接: https://arxiv.org/abs/2409.06699
作者: Md Taimur Ahad,Sumaya Mustofa,Faruk Ahmed,Yousuf Rayhan Emon,Aunirudra Dey Anu
关键词-EN: ensemble model, transfer learning, breast cancer, breast cancer detection, Convolutional Neural Networks
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In deep learning, transfer learning and ensemble models have shown promise in improving computer-aided disease diagnosis. However, applying the transfer learning and ensemble model is still relatively limited. Moreover, the ensemble model’s development is ad-hoc, overlooks redundant layers, and suffers from imbalanced datasets and inadequate augmentation. Lastly, significant Deep Convolutional Neural Networks (D-CNNs) have been introduced to detect and classify breast cancer. Still, very few comparative studies were conducted to investigate the accuracy and efficiency of existing CNN architectures. Realising the gaps, this study compares the performance of D-CNN, which includes the original CNN, transfer learning, and an ensemble model, in detecting breast cancer. The comparison study of this paper consists of comparison using six CNN-based deep learning architectures (SE-ResNet152, MobileNetV2, VGG19, ResNet18, InceptionV3, and DenseNet-121), a transfer learning, and an ensemble model on breast cancer detection. Among the comparison of these models, the ensemble model provides the highest detection and classification accuracy of 99.94% for breast cancer detection and classification. However, this study also provides a negative result in the case of transfer learning, as the transfer learning did not increase the accuracy of the original SE-ResNet152, MobileNetV2, VGG19, ResNet18, InceptionV3, and DenseNet-121 model. The high accuracy in detecting and categorising breast cancer detection using CNN suggests that the CNN model is promising in breast cancer disease detection. This research is significant in biomedical engineering, computer-aided disease diagnosis, and ML-based disease detection.

[CV-92] A comprehensive study on Blood Cancer detection and classification using Convolutional Neural Network

链接: https://arxiv.org/abs/2409.06689
作者: Md Taimur Ahad,Sajib Bin Mamun,Sumaya Mustofa,Bo Song,Yan Li
关键词-EN: Convolutional Neural Networks, efficient Convolutional Neural, Convolutional Neural, Neural Networks, efficient Convolutional
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Over the years in object detection several efficient Convolutional Neural Networks (CNN) networks, such as DenseNet201, InceptionV3, ResNet152v2, SEresNet152, VGG19, Xception gained significant attention due to their performance. Moreover, CNN paradigms have expanded to transfer learning and ensemble models from original CNN architectures. Research studies suggest that transfer learning and ensemble models are capable of increasing the accuracy of deep learning (DL) models. However, very few studies have conducted comprehensive experiments utilizing these techniques in detecting and localizing blood malignancies. Realizing the gap, this study conducted three experiments; in the first experiment – six original CNNs were used, in the second experiment – transfer learning and, in the third experiment a novel ensemble model DIX (DenseNet201, InceptionV3, and Xception) was developed to detect and classify blood cancer. The statistical result suggests that DIX outperformed the original and transfer learning performance, providing an accuracy of 99.12%. However, this study also provides a negative result in the case of transfer learning, as the transfer learning did not increase the accuracy of the original CNNs. Like many other cancers, blood cancer diseases require timely identification for effective treatment plans and increased survival possibilities. The high accuracy in detecting and categorization blood cancer detection using CNN suggests that the CNN model is promising in blood cancer disease detection. This research is significant in the fields of biomedical engineering, computer-aided disease diagnosis, and ML-based disease detection.

[CV-93] A study on deep feature extraction to detect and classify Acute Lymphoblastic Leukemia (ALL)

链接: https://arxiv.org/abs/2409.06687
作者: Sabit Ahamed Preanto(4IR Research Cell Daffodil International University, Dhaka, Bangladesh),Md. Taimur Ahad(4IR Research Cell Daffodil International University, Dhaka, Bangladesh),Yousuf Rayhan Emon(4IR Research Cell Daffodil International University, Dhaka, Bangladesh),Sumaya Mustofa(4IR Research Cell Daffodil International University, Dhaka, Bangladesh),Md Alamin(4IR Research Cell Daffodil International University, Dhaka, Bangladesh)
关键词-EN: Acute lymphoblastic leukaemia, Convolutional Neural Networks, Acute lymphoblastic, specifically Convolutional Neural, lymphoblastic leukaemia
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Acute lymphoblastic leukaemia (ALL) is a blood malignancy that mainly affects adults and children. This study looks into the use of deep learning, specifically Convolutional Neural Networks (CNNs), for the detection and classification of ALL. Conventional techniques for ALL diagnosis, such bone marrow biopsy, are costly and prone to mistakes made by hand. By utilising automated technologies, the research seeks to improve diagnostic accuracy. The research uses a variety of pre-trained CNN models, such as InceptionV3, ResNet101, VGG19, DenseNet121, MobileNetV2, and DenseNet121, to extract characteristics from pictures of blood smears. ANOVA, Recursive Feature Elimination (RFE), Random Forest, Lasso, and Principal Component Analysis (PCA) are a few of the selection approaches used to find the most relevant features after feature extraction. Following that, machine learning methods like Naïve Bayes, Random Forest, Support Vector Machine (SVM), and K-Nearest Neighbours (KNN) are used to classify these features. With an 87% accuracy rate, the ResNet101 model produced the best results, closely followed by DenseNet121 and VGG19. According to the study, CNN-based models have the potential to decrease the need for medical specialists by increasing the speed and accuracy of ALL diagnosis. To improve model performance, the study also recommends expanding and diversifying datasets and investigating more sophisticated designs such as transformers. This study highlights how well automated deep learning systems do medical diagnosis.

[CV-94] Constructing an Interpretable Deep Denoiser by Unrolling Graph Laplacian Regularizer

链接: https://arxiv.org/abs/2409.06676
作者: Seyed Alireza Hosseini,Tam Thuc Do,Gene Cheung,Yuichi Tanaka
关键词-EN: Taylor Series Expansion, graph Laplacian regularizer, wide range, range of restoration, graph Laplacian
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:An image denoiser can be used for a wide range of restoration problems via the Plug-and-Play (PnP) architecture. In this paper, we propose a general framework to build an interpretable graph-based deep denoiser (GDD) by unrolling a solution to a maximum a posteriori (MAP) problem equipped with a graph Laplacian regularizer (GLR) as signal prior. Leveraging a recent theorem showing that any (pseudo-)linear denoiser \boldsymbol \Psi , under mild conditions, can be mapped to a solution of a MAP denoising problem regularized using GLR, we first initialize a graph Laplacian matrix \mathbf L via truncated Taylor Series Expansion (TSE) of \boldsymbol \Psi^-1 . Then, we compute the MAP linear system solution by unrolling iterations of the conjugate gradient (CG) algorithm into a sequence of neural layers as a feed-forward network – one that is amenable to parameter tuning. The resulting GDD network is “graph-interpretable”, low in parameter count, and easy to initialize thanks to \mathbf L derived from a known well-performing denoiser \boldsymbol \Psi . Experimental results show that GDD achieves competitive image denoising performance compared to competitors, but employing far fewer parameters, and is more robust to covariate shift.

[CV-95] Interactive 3D Segmentation for Primary Gross Tumor Volume in Oropharyngeal Cancer

链接: https://arxiv.org/abs/2409.06605
作者: Mikko Saukkoriipi,Jaakko Sahlsten,Joel Jaskari,Lotta Orasmaa,Jari Kangas,Nastaran Rasouli,Roope Raisamo,Jussi Hirvonen,Helena Mehtonen,Jorma Järnstedt,Antti Mäkitie,Mohamed Naser,Clifton Fuller,Benjamin Kann,Kimmo Kaski
关键词-EN: main treatment modality, gross tumor volume, primary gross tumor, accurate GTVp segmentation, main treatment
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The main treatment modality for oropharyngeal cancer (OPC) is radiotherapy, where accurate segmentation of the primary gross tumor volume (GTVp) is essential. However, accurate GTVp segmentation is challenging due to significant interobserver variability and the time-consuming nature of manual annotation, while fully automated methods can occasionally fail. An interactive deep learning (DL) model offers the advantage of automatic high-performance segmentation with the flexibility for user correction when necessary. In this study, we examine interactive DL for GTVp segmentation in OPC. We implement state-of-the-art algorithms and propose a novel two-stage Interactive Click Refinement (2S-ICR) framework. Using the 2021 HEad and neCK TumOR (HECKTOR) dataset for development and an external dataset from The University of Texas MD Anderson Cancer Center for evaluation, the 2S-ICR framework achieves a Dice similarity coefficient of 0.713 \pm 0.152 without user interaction and 0.824 \pm 0.099 after five interactions, outperforming existing methods in both cases.

[CV-96] Continual Domain Incremental Learning for Privacy-aware Digital Pathology MICCAI2024

链接: https://arxiv.org/abs/2409.06455
作者: Pratibha Kumari,Daniel Reisenbüchler,Lucas Luttner,Nadine S. Schaadt,Friedrich Feuerhake,Dorit Merhof
关键词-EN: advanced deep-learning algorithms, complex tissue patterns, model complex tissue, recent years, digital pathology
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in MICCAI 2024

点击查看摘要

Abstract:In recent years, there has been remarkable progress in the field of digital pathology, driven by the ability to model complex tissue patterns using advanced deep-learning algorithms. However, the robustness of these models is often severely compromised in the presence of data shifts (e.g., different stains, organs, centers, etc.). Alternatively, continual learning (CL) techniques aim to reduce the forgetting of past data when learning new data with distributional shift conditions. Specifically, rehearsal-based CL techniques, which store some past data in a buffer and then replay it with new data, have proven effective in medical image analysis tasks. However, privacy concerns arise as these approaches store past data, prompting the development of our novel Generative Latent Replay-based CL (GLRCL) approach. GLRCL captures the previous distribution through Gaussian Mixture Models instead of storing past samples, which are then utilized to generate features and perform latent replay with new data. We systematically evaluate our proposed framework under different shift conditions in histopathology data, including stain and organ shift. Our approach significantly outperforms popular buffer-free CL approaches and performs similarly to rehearsal-based CL approaches that require large buffers causing serious privacy violations.

[CV-97] Unrevealed Threats: A Comprehensive Study of the Adversarial Robustness of Underwater Image Enhancement Models

链接: https://arxiv.org/abs/2409.06420
作者: Siyu Zhai,Zhibo He,Xiaofeng Cong,Junming Hou,Jie Gui,Jian Wei You,Xin Gong,James Tin-Yau Kwok,Yuan Yan Tang
关键词-EN: UWIE models, undergone extensive exploration, UWIE, adversarial, models
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Learning-based methods for underwater image enhancement (UWIE) have undergone extensive exploration. However, learning-based models are usually vulnerable to adversarial examples so as the UWIE models. To the best of our knowledge, there is no comprehensive study on the adversarial robustness of UWIE models, which indicates that UWIE models are potentially under the threat of adversarial attacks. In this paper, we propose a general adversarial attack protocol. We make a first attempt to conduct adversarial attacks on five well-designed UWIE models on three common underwater image benchmark datasets. Considering the scattering and absorption of light in the underwater environment, there exists a strong correlation between color correction and underwater image enhancement. On the basis of that, we also design two effective UWIE-oriented adversarial attack methods Pixel Attack and Color Shift Attack targeting different color spaces. The results show that five models exhibit varying degrees of vulnerability to adversarial attacks and well-designed small perturbations on degraded images are capable of preventing UWIE models from generating enhanced results. Further, we conduct adversarial training on these models and successfully mitigated the effectiveness of adversarial attacks. In summary, we reveal the adversarial vulnerability of UWIE models and propose a new evaluation dimension of UWIE models.

[CV-98] Analyzing Tumors by Synthesis

链接: https://arxiv.org/abs/2409.06035
作者: Qi Chen,Yuxiang Lai,Xiaoxi Chen,Qixin Hu,Alan Yuille,Zongwei Zhou
关键词-EN: United States, shown great potential, Computer-aided tumor detection, scans performed annually, Computer-aided tumor
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted as a chapter in the Springer Book: “Generative Machine Learning Models in Medical Image Computing.”

点击查看摘要

Abstract:Computer-aided tumor detection has shown great potential in enhancing the interpretation of over 80 million CT scans performed annually in the United States. However, challenges arise due to the rarity of CT scans with tumors, especially early-stage tumors. Developing AI with real tumor data faces issues of scarcity, annotation difficulty, and low prevalence. Tumor synthesis addresses these challenges by generating numerous tumor examples in medical images, aiding AI training for tumor detection and segmentation. Successful synthesis requires realistic and generalizable synthetic tumors across various organs. This chapter reviews AI development on real and synthetic data and summarizes two key trends in synthetic data for cancer imaging research: modeling-based and learning-based approaches. Modeling-based methods, like Pixel2Cancer, simulate tumor development over time using generic rules, while learning-based methods, like DiffTumor, learn from a few annotated examples in one organ to generate synthetic tumors in others. Reader studies with expert radiologists show that synthetic tumors can be convincingly realistic. We also present case studies in the liver, pancreas, and kidneys reveal that AI trained on synthetic tumors can achieve performance comparable to, or better than, AI only trained on real data. Tumor synthesis holds significant promise for expanding datasets, enhancing AI reliability, improving tumor detection performance, and preserving patient privacy.

[CV-99] Pioneering Precision in Lumbar Spine MRI Segmentation with Advanced Deep Learning and Data Enhancement

链接: https://arxiv.org/abs/2409.06018
作者: Istiak Ahmed,Md. Tanzim Hossain,Md. Zahirul Islam Nahid,Kazi Shahriar Sanjid,Md. Shakib Shahariar Junayed,M. Monir Uddin,Mohammad Monirujjaman Khan
关键词-EN: addressing key challenges, deep learning techniques, focusing on addressing, study presents, presents an advanced
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This study presents an advanced approach to lumbar spine segmentation using deep learning techniques, focusing on addressing key challenges such as class imbalance and data preprocessing. Magnetic resonance imaging (MRI) scans of patients with low back pain are meticulously preprocessed to accurately represent three critical classes: vertebrae, spinal canal, and intervertebral discs (IVDs). By rectifying class inconsistencies in the data preprocessing stage, the fidelity of the training data is ensured. The modified U-Net model incorporates innovative architectural enhancements, including an upsample block with leaky Rectified Linear Units (ReLU) and Glorot uniform initializer, to mitigate common issues such as the dying ReLU problem and improve stability during training. Introducing a custom combined loss function effectively tackles class imbalance, significantly improving segmentation accuracy. Evaluation using a comprehensive suite of metrics showcases the superior performance of this approach, outperforming existing methods and advancing the current techniques in lumbar spine segmentation. These findings hold significant advancements for enhanced lumbar spine MRI and segmentation diagnostic accuracy.

[CV-100] Enhancing Cross-Modality Synthesis: Subvolume Merging for MRI-to-CT Conversion

链接: https://arxiv.org/abs/2409.05982
作者: Fuxin Fan,Jingna Qiu,Yixing Huang,Andreas Maier
关键词-EN: synthetic computed tomography, tissue attenuation information, magnetic resonance imaging, therapy treatment planning, precise tissue attenuation
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Providing more precise tissue attenuation information, synthetic computed tomography (sCT) generated from magnetic resonance imaging (MRI) contributes to improved radiation therapy treatment planning. In our study, we employ the advanced SwinUNETR framework for synthesizing CT from MRI images. Additionally, we introduce a three-dimensional subvolume merging technique in the prediction process. By selecting an optimal overlap percentage for adjacent subvolumes, stitching artifacts are effectively mitigated, leading to a decrease in the mean absolute error (MAE) between sCT and the labels from 52.65 HU to 47.75 HU. Furthermore, implementing a weight function with a gamma value of 0.9 results in the lowest MAE within the same overlap area. By setting the overlap percentage between 50% and 70%, we achieve a balance between image quality and computational efficiency.

机器学习

[LG-0] DANCE: Deep Learning-Assisted Analysis of Protein Sequences Using Chaos Enhanced Kaleidoscopic Images

链接: https://arxiv.org/abs/2409.06694
作者: Taslim Murad,Prakash Chourasia,Sarwan Ali,Murray Patterson
关键词-EN: uncontrolled cell growth, protein sequences, complex disease characterized, T-cell protein sequences, protein
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Cancer is a complex disease characterized by uncontrolled cell growth. T cell receptors (TCRs), crucial proteins in the immune system, play a key role in recognizing antigens, including those associated with cancer. Recent advancements in sequencing technologies have facilitated comprehensive profiling of TCR repertoires, uncovering TCRs with potent anti-cancer activity and enabling TCR-based immunotherapies. However, analyzing these intricate biomolecules necessitates efficient representations that capture their structural and functional information. T-cell protein sequences pose unique challenges due to their relatively smaller lengths compared to other biomolecules. An image-based representation approach becomes a preferred choice for efficient embeddings, allowing for the preservation of essential details and enabling comprehensive analysis of T-cell protein sequences. In this paper, we propose to generate images from the protein sequences using the idea of Chaos Game Representation (CGR) using the Kaleidoscopic images approach. This Deep Learning Assisted Analysis of Protein Sequences Using Chaos Enhanced Kaleidoscopic Images (called DANCE) provides a unique way to visualize protein sequences by recursively applying chaos game rules around a central seed point. we perform the classification of the T cell receptors (TCRs) protein sequences in terms of their respective target cancer cells, as TCRs are known for their immune response against cancer disease. The TCR sequences are converted into images using the DANCE method. We employ deep-learning vision models to perform the classification to obtain insights into the relationship between the visual patterns observed in the generated kaleidoscopic images and the underlying protein properties. By combining CGR-based image generation with deep learning classification, this study opens novel possibilities in the protein analysis domain.

[LG-1] HybridFC: A Hybrid Fact-Checking Approach for Knowledge Graphs

点击查看摘要

[LG-2] Geometric-Averaged Preference Optimization for Soft Preference Labels

链接: https://arxiv.org/abs/2409.06691
作者: Hiroki Furuta,Kuang-Huei Lee,Shixiang Shane Gu,Yutaka Matsuo,Aleksandra Faust,Heiga Zen,Izzeddin Gur
关键词-EN: human preferences assume, human preferences, Direct Preference Optimization, soft preference labels, algorithms for aligning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[LG-3] Liability and Insurance for Catastrophic Losses: the Nuclear Power Precedent and Lessons for AI ICML2024

点击查看摘要

[LG-4] Insuring Uninsurable Risks from AI: The State as Insurer of Last Resort ICML2024

点击查看摘要

[LG-5] DA-MoE: Towards Dynamic Expert Allocation for Mixture-of-Experts Models

链接: https://arxiv.org/abs/2409.06669
作者: Maryam Akhavan Aghdam,Hongpeng Jin,Yanzhao Wu
关键词-EN: Natural Language Processing, Language Processing, recent technological advancements, Natural Language, Transformer-based
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformer-based Mixture-of-Experts (MoE) models have been driving several recent technological advancements in Natural Language Processing (NLP). These MoE models adopt a router mechanism to determine which experts to activate for routing input tokens. However, existing router mechanisms allocate a fixed number of experts to each token, which neglects the varying importance of different input tokens. In this study, we propose a novel dynamic router mechanism that Dynamically Allocates a variable number of experts for Mixture-of-Experts (DA-MoE) models based on an effective token importance measure. First, we show that the Transformer attention mechanism provides a natural and effective way of calculating token importance. Second, we propose a dynamic router mechanism that effectively decides the optimal number of experts (K) and allocates the top-K experts for each input token. Third, comprehensive experiments on several benchmark datasets demonstrate that our DA-MoE approach consistently outperforms the state-of-the-art Transformer based MoE model on the popular GLUE benchmark.

[LG-6] A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio

链接: https://arxiv.org/abs/2409.06624
作者: Ningyuan Xi,Yetao Wu,Kun Fan,Teng Chen,Qingqing Gu,Peng Yu,Jinxian Qu,Chenxi Liu,Zhonglin Jiang,Yong Chen,Luo Ji
关键词-EN: Large Language Models, unfamiliar language skill, Continual Pre-Trained, Language Mixture Ratio, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 pages, 4 figures

点击查看摘要

[LG-7] Hierarchical Multi-Label Classification with Missing Information for Benthic Habitat Imagery

点击查看摘要

[LG-8] One-Shot Imitation under Mismatched Execution

点击查看摘要

[LG-9] DemoStart: Demonstration-led auto-curriculum applied to sim-to-real with multi-fingered robots

链接: https://arxiv.org/abs/2409.06613
作者: Maria Bauza,Jose Enrique Chen,Valentin Dalibard,Nimrod Gileadi,Roland Hafner,Murilo F. Martins,Joss Moore,Rugile Pevceviciute,Antoine Laurens,Dushyant Rao,Martina Zambelli,Martin Riedmiller,Jon Scholz,Konstantinos Bousmalis,Francesco Nori,Nicolas Heess
关键词-EN: three-fingered robotic hand, complex manipulation behaviors, auto-curriculum reinforcement learning, reinforcement learning method, learning method capable
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 15 pages total with 7 pages of appendix. 9 Figures, 4 in the main text and 5 in the appendix

点击查看摘要

Abstract:We present DemoStart, a novel auto-curriculum reinforcement learning method capable of learning complex manipulation behaviors on an arm equipped with a three-fingered robotic hand, from only a sparse reward and a handful of demonstrations in simulation. Learning from simulation drastically reduces the development cycle of behavior generation, and domain randomization techniques are leveraged to achieve successful zero-shot sim-to-real transfer. Transferred policies are learned directly from raw pixels from multiple cameras and robot proprioception. Our approach outperforms policies learned from demonstrations on the real robot and requires 100 times fewer demonstrations, collected in simulation. More details and videos in this https URL.

[LG-10] Label-free Monitoring of Self-Supervised Learning Progress

点击查看摘要

[LG-11] Improving the Precision of CNNs for Magnetic Resonance Spectral Modeling

点击查看摘要

[LG-12] Alleviating Hallucinations in Large Language Models with Scepticism Modeling

链接: https://arxiv.org/abs/2409.06601
作者: Yetao Wu,Yihong Wang,Teng Chen,Chenxi Liu,Ningyuan Xi,Qingqing Gu,Hongyang Lei,Zhonglin Jiang,Yong Chen,Luo Ji
关键词-EN: large language models, prevents adoption, diverse fields, major challenge, challenge for large
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 11 pages, 6 figures

点击查看摘要

[LG-13] Developing the Temporal Graph Convolutional Neural Network Model to Predict Hip Replacement using Electronic Health Records ICML

点击查看摘要

[LG-14] Learn2Aggregate: Supervised Generation of Chvatal-Gomory Cuts Using Graph Neural Networks

链接: https://arxiv.org/abs/2409.06559
作者: Arnaud Deza,Elias B. Khalil,Zhenan Fan,Zirui Zhou,Yong Zhang
关键词-EN: integer linear programming, mixed integer linear, linear programming, mixed integer, integer linear
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 12 pages, 8 figures

点击查看摘要

Abstract:We present \textitLearn2Aggregate , a machine learning (ML) framework for optimizing the generation of Chvátal-Gomory (CG) cuts in mixed integer linear programming (MILP). The framework trains a graph neural network to classify useful constraints for aggregation in CG cut generation. The ML-driven CG separator selectively focuses on a small set of impactful constraints, improving runtimes without compromising the strength of the generated cuts. Key to our approach is the formulation of a constraint classification task which favours sparse aggregation of constraints, consistent with empirical findings. This, in conjunction with a careful constraint labeling scheme and a hybrid of deep learning and feature engineering, results in enhanced CG cut generation across five diverse MILP benchmarks. On the largest test sets, our method closes roughly \textittwice as much of the integrality gap as the standard CG method while running 40 % faster. This performance improvement is due to our method eliminating 75% of the constraints prior to aggregation.

[LG-15] Dynamic Decoupling of Placid Terminal Attractor-based Gradient Descent Algorithm

链接: https://arxiv.org/abs/2409.06542
作者: Jinwei Zhao(1),Marco Gori(2),Alessandro Betti(3),Stefano Melacci(2),Hongtao Zhang(1),Jiedong Liu(1),Xinhong Hei(1) ((1) Faculty of Computer Science and Engineering, Xi’an University of Technology, Xi’an, China (2) Department of Information Engineering and Mathematics, University of Siena, Siena, Italy (3) IMT Scuola Alti Studi, Lucca, Italy)
关键词-EN: stochastic gradient descent, Gradient descent, application domains, large number, number of application
类目: Machine Learning (cs.LG)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:Gradient descent (GD) and stochastic gradient descent (SGD) have been widely used in a large number of application domains. Therefore, understanding the dynamics of GD and improving its convergence speed is still of great importance. This paper carefully analyzes the dynamics of GD based on the terminal attractor at different stages of its gradient flow. On the basis of the terminal sliding mode theory and the terminal attractor theory, four adaptive learning rates are designed. Their performances are investigated in light of a detailed theoretical investigation, and the running times of the learning procedures are evaluated and compared. The total times of their learning processes are also studied in detail. To evaluate their effectiveness, various simulation results are investigated on a function approximation problem and an image classification problem.

[LG-16] MENSA: A Multi-Event Network for Survival Analysis under Informative Censoring AAAI2025

链接: https://arxiv.org/abs/2409.06525
作者: Christian Marius Lillelund,Ali Hossein Gharari Foomani,Weijie Sun,Shi-ang Qi,Russell Greiner
关键词-EN: survival model predicts, multi-event survival model, multi-event survival, instance experiences, multi-event survival analysis
类目: Machine Learning (cs.LG)
*备注: Submitted to AAAI 2025

点击查看摘要

Abstract:Given an instance, a multi-event survival model predicts the time until that instance experiences each of several different events. These events are not mutually exclusive and there are often statistical dependencies between them. There are relatively few multi-event survival results, most focusing on producing a simple risk score, rather than the time-to-event itself. To overcome these issues, we introduce MENSA, a novel, deep learning approach for multi-event survival analysis that can jointly learn representations of the input covariates and the dependence structure between events. As a practical motivation for multi-event survival analysis, we consider the problem of predicting the time until a patient with amyotrophic lateral sclerosis (ALS) loses various physical functions, i.e., the ability to speak, swallow, write, or walk. When estimating when a patient is no longer able to swallow, our approach achieves an L1-Margin loss of 278.8 days, compared to 355.2 days when modeling each event separately. In addition, we also evaluate our approach in single-event and competing risk scenarios by modeling the censoring and event distributions as equal contributing factors in the optimization process, and show that our approach performs well across multiple benchmark datasets. The source code is available at: this https URL

[LG-17] Deep Learning for Koopman Operator Estimation in Idealized Atmospheric Dynamics

链接: https://arxiv.org/abs/2409.06522
作者: David Millard,Arielle Carr,Stéphane Gaudreault
关键词-EN: revolutionizing weather forecasting, Deep learning, models achieving accuracy, operational physical models, weather forecasting
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Deep learning is revolutionizing weather forecasting, with new data-driven models achieving accuracy on par with operational physical models for medium-term predictions. However, these models often lack interpretability, making their underlying dynamics difficult to understand and explain. This paper proposes methodologies to estimate the Koopman operator, providing a linear representation of complex nonlinear dynamics to enhance the transparency of data-driven models. Despite its potential, applying the Koopman operator to large-scale problems, such as atmospheric modeling, remains challenging. This study aims to identify the limitations of existing methods, refine these models to overcome various bottlenecks, and introduce novel convolutional neural network architectures that capture simplified dynamics.

[LG-18] Aligning Machine and Human Visual Representations across Abstraction Levels

点击查看摘要

[LG-19] Superior Computer Chess with Model Predictive Control Reinforcement Learning and Rollout

点击查看摘要

[LG-20] A Machine Learning Based Approach for Statistical Analysis of Detonation Cells from Soot Foils

链接: https://arxiv.org/abs/2409.06466
作者: Vansh Sharma,Michael Ullman,Venkat Raman
关键词-EN: primitive edge detection, edge detection methods, detection methods prevalent, machine learning, addressing the limitations
类目: Machine Learning (cs.LG)
*备注: 23 pages, 12 figures, submitted to Comb. and Flame

点击查看摘要

Abstract:This study presents a novel algorithm based on machine learning (ML) for the precise segmentation and measurement of detonation cells from soot foil images, addressing the limitations of manual and primitive edge detection methods prevalent in the field. Using advances in cellular biology segmentation models, the proposed algorithm is designed to accurately extract cellular patterns without a training procedure or dataset, which is a significant challenge in detonation research. The algorithm’s performance was validated using a series of test cases that mimic experimental and numerical detonation studies. The results demonstrated consistent accuracy, with errors remaining within 10%, even in complex cases. The algorithm effectively captured key cell metrics such as cell area and span, revealing trends across different soot foil samples with uniform to highly irregular cellular structures. Although the model proved robust, challenges remain in segmenting and analyzing highly complex or irregular cellular patterns. This work highlights the broad applicability and potential of the algorithm to advance the understanding of detonation wave dynamics.

[LG-21] Ransomware Detection Using Machine Learning in the Linux Kernel

链接: https://arxiv.org/abs/2409.06452
作者: Adrian Brodzik,Tomasz Malec-Kruszyński,Wojciech Niewolski,Mikołaj Tkaczyk,Krzysztof Bocianiak,Sok-Yen Loui
关键词-EN: Linux-based cloud environments, Berkeley Packet Filter, Linux-based cloud, employing various encryption, unprecedented speeds
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Linux-based cloud environments have become lucrative targets for ransomware attacks, employing various encryption schemes at unprecedented speeds. Addressing the urgency for real-time ransomware protection, we propose leveraging the extended Berkeley Packet Filter (eBPF) to collect system call information regarding active processes and infer about the data directly at the kernel level. In this study, we implement two Machine Learning (ML) models in eBPF - a decision tree and a multilayer perceptron. Benchmarking latency and accuracy against their user space counterparts, our findings underscore the efficacy of this approach.

[LG-22] HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data

链接: https://arxiv.org/abs/2409.06446
作者: Hossein Hajipour,Lea Schönherr,Thorsten Holz,Mario Fritz
关键词-EN: shown great potential, GitHub Copilot, Large language models, Large language, shown great
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 24 pages, 16 tables, 8 figures

点击查看摘要

[LG-23] Extending Explainable Ensemble Trees (E2Tree) to regression contexts

链接: https://arxiv.org/abs/2409.06439
作者: Massimo Aria,Agostino Gnasso,Carmela Iorio,Marjolein Fokkema
关键词-EN: multiple weak learners, offering highly accurate, highly accurate prediction, supervised learning, offering highly
类目: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Ensemble methods such as random forests have transformed the landscape of supervised learning, offering highly accurate prediction through the aggregation of multiple weak learners. However, despite their effectiveness, these methods often lack transparency, impeding users’ comprehension of how RF models arrive at their predictions. Explainable ensemble trees (E2Tree) is a novel methodology for explaining random forests, that provides a graphical representation of the relationship between response variables and predictors. A striking characteristic of E2Tree is that it not only accounts for the effects of predictor variables on the response but also accounts for associations between the predictor variables through the computation and use of dissimilarity measures. The E2Tree methodology was initially proposed for use in classification tasks. In this paper, we extend the methodology to encompass regression contexts. To demonstrate the explanatory power of the proposed algorithm, we illustrate its use on real-world datasets.

[LG-24] A Short Information-Theoretic Analysis of Linear Auto-Regressive Learning

链接: https://arxiv.org/abs/2409.06437
作者: Ingvar Ziemann
关键词-EN: Gaussian maximum likelihood, linear auto-regressive models, maximum likelihood estimator, short information-theoretic proof, Gaussian maximum
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this note, we give a short information-theoretic proof of the consistency of the Gaussian maximum likelihood estimator in linear auto-regressive models. Our proof yields nearly optimal non-asymptotic rates for parameter recovery and works without any invocation of stability in the case of finite hypothesis classes.

[LG-25] Fine-tuning and Prompt Engineering with Cognitive Knowledge Graphs for Scholarly Knowledge Organization

链接: https://arxiv.org/abs/2409.06433
作者: Gollam Rabby,Sören Auer,Jennifer D’Souza,Allard Oelen
关键词-EN: knowledge, million yearly, published scholarly articles, scholarly, raises the challenge
类目: Digital Libraries (cs.DL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing amount of published scholarly articles, exceeding 2.5 million yearly, raises the challenge for researchers in following scientific progress. Integrating the contributions from scholarly articles into a novel type of cognitive knowledge graph (CKG) will be a crucial element for accessing and organizing scholarly knowledge, surpassing the insights provided by titles and abstracts. This research focuses on effectively conveying structured scholarly knowledge by utilizing large language models (LLMs) to categorize scholarly articles and describe their contributions in a structured and comparable manner. While previous studies explored language models within specific research domains, the extensive domain-independent knowledge captured by LLMs offers a substantial opportunity for generating structured contribution descriptions as CKGs. Additionally, LLMs offer customizable pathways through prompt engineering or fine-tuning, thus facilitating to leveraging of smaller LLMs known for their efficiency, cost-effectiveness, and environmental considerations. Our methodology involves harnessing LLM knowledge, and complementing it with domain expert-verified scholarly data sourced from a CKG. This strategic fusion significantly enhances LLM performance, especially in tasks like scholarly article categorization and predicate recommendation. Our method involves fine-tuning LLMs with CKG knowledge and additionally injecting knowledge from a CKG with a novel prompting technique significantly increasing the accuracy of scholarly knowledge extraction. We integrated our approach in the Open Research Knowledge Graph (ORKG), thus enabling precise access to organized scholarly knowledge, crucially benefiting domain-independent scholarly knowledge exchange and dissemination among policymakers, industrial practitioners, and the general public.

[LG-26] GeMuCo: Generalized Multisensory Correlational Model for Body Schema Learning

点击查看摘要

[LG-27] Exploring the Integration of Large Language Models in Industrial Test Maintenance Processes

点击查看摘要

[LG-28] Length Desensitization in Directed Preference Optimization

链接: https://arxiv.org/abs/2409.06411
作者: Wei Liu,Yang Bai,Chengcheng Han,Rongxiang Weng,Jun Xu,Xuezhi Cao,Jingang Wang,Xunliang Cai
关键词-EN: Large Language Models, align Large Language, Large Language, Human Feedback, Direct Preference Optimization
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: 21 pages, 9 figures

点击查看摘要

[LG-29] Sources of Uncertainty in 3D Scene Reconstruction ALT ECCV2024

点击查看摘要

[LG-30] Symmetry Breaking in Neural Network Optimization: Insights from Input Dimension Expansion

点击查看摘要

[LG-31] One Policy to Run Them All: an End-to-end Learning Approach to Multi-Embodiment Locomotion

链接: https://arxiv.org/abs/2409.06366
作者: Nico Bohlinger,Grzegorz Czechmanowski,Maciej Krupka,Piotr Kicki,Krzysztof Walas,Jan Peters,Davide Tateo
关键词-EN: Deep Reinforcement Learning, Reinforcement Learning techniques, Deep Reinforcement, Reinforcement Learning, Multi-Task Reinforcement Learning
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep Reinforcement Learning techniques are achieving state-of-the-art results in robust legged locomotion. While there exists a wide variety of legged platforms such as quadruped, humanoids, and hexapods, the field is still missing a single learning framework that can control all these different embodiments easily and effectively and possibly transfer, zero or few-shot, to unseen robot embodiments. We introduce URMA, the Unified Robot Morphology Architecture, to close this gap. Our framework brings the end-to-end Multi-Task Reinforcement Learning approach to the realm of legged robots, enabling the learned policy to control any type of robot morphology. The key idea of our method is to allow the network to learn an abstract locomotion controller that can be seamlessly shared between embodiments thanks to our morphology-agnostic encoders and decoders. This flexible architecture can be seen as a potential first step in building a foundation model for legged robot locomotion. Our experiments show that URMA can learn a locomotion policy on multiple embodiments that can be easily transferred to unseen robot platforms in simulation and the real world.

[LG-32] What happens to diffusion model likelihood when your model is conditional?

链接: https://arxiv.org/abs/2409.06364
作者: Mattias Cross,Anton Ragni
关键词-EN: iteratively denoise random, produce high-quality data, denoise random samples, Stochastic Differential Equations, Diffusion Models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion Models (DMs) iteratively denoise random samples to produce high-quality data. The iterative sampling process is derived from Stochastic Differential Equations (SDEs), allowing a speed-quality trade-off chosen at inference. Another advantage of sampling with differential equations is exact likelihood computation. These likelihoods have been used to rank unconditional DMs and for out-of-domain classification. Despite the many existing and possible uses of DM likelihoods, the distinct properties captured are unknown, especially in conditional contexts such as Text-To-Image (TTI) or Text-To-Speech synthesis (TTS). Surprisingly, we find that TTS DM likelihoods are agnostic to the text input. TTI likelihood is more expressive but cannot discern confounding prompts. Our results show that applying DMs to conditional tasks reveals inconsistencies and strengthens claims that the properties of DM likelihood are unknown. This impact sheds light on the previously unknown nature of DM likelihoods. Although conditional DMs maximise likelihood, the likelihood in question is not as sensitive to the conditioning input as one expects. This investigation provides a new point-of-view on diffusion likelihoods.

[LG-33] Connecting Concept Convexity and Human-Machine Alignment in Deep Neural Networks

点击查看摘要

[LG-34] Double Successive Over-Relaxation Q-Learning with an Extension to Deep Reinforcement Learning

链接: https://arxiv.org/abs/2409.06356
作者: Shreyas S R
关键词-EN: SOR Q-learning algorithm, SOR Q-learning, double SOR Q-learning, reinforcement learning, Q-learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Q-learning is a widely used algorithm in reinforcement learning (RL), but its convergence can be slow, especially when the discount factor is close to one. Successive Over-Relaxation (SOR) Q-learning, which introduces a relaxation factor to speed up convergence, addresses this issue but has two major limitations: In the tabular setting, the relaxation parameter depends on transition probability, making it not entirely model-free, and it suffers from overestimation bias. To overcome these limitations, we propose a sample-based, model-free double SOR Q-learning algorithm. Theoretically and empirically, this algorithm is shown to be less biased than SOR Q-learning. Further, in the tabular setting, the convergence analysis under boundedness assumptions on iterates is discussed. The proposed algorithm is extended to large-scale problems using deep RL. Finally, the tabular version of the proposed algorithm is compared using roulette and grid world environments, while the deep RL version is tested on a maximization bias example and OpenAI Gym environments.

[LG-35] Improving Conditional Level Generation using Automated Validation in Match-3 Games

链接: https://arxiv.org/abs/2409.06349
作者: Monica Villanueva Aylagas,Joakim Bergdahl,Jonas Gillberg,Alessandro Sestini,Theodor Tolstoy,Linus Gisslén
关键词-EN: shown great potential, Generative models, shown great, great potential, Generative
类目: Machine Learning (cs.LG)
*备注: 10 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Generative models for level generation have shown great potential in game production. However, they often provide limited control over the generation, and the validity of the generated levels is unreliable. Despite this fact, only a few approaches that learn from existing data provide the users with ways of controlling the generation, simultaneously addressing the generation of unsolvable levels. %One of the main challenges it faces is that levels generated through automation may not be solvable thus requiring validation. are not always engaging, challenging, or even solvable. This paper proposes Avalon, a novel method to improve models that learn from existing level designs using difficulty statistics extracted from gameplay. In particular, we use a conditional variational autoencoder to generate layouts for match-3 levels, conditioning the model on pre-collected statistics such as game mechanics like difficulty and relevant visual features like size and symmetry. Our method is general enough that multiple approaches could potentially be used to generate these statistics. We quantitatively evaluate our approach by comparing it to an ablated model without difficulty conditioning. Additionally, we analyze both quantitatively and qualitatively whether the style of the dataset is preserved in the generated levels. Our approach generates more valid levels than the same method without difficulty conditioning.

[LG-36] Compute-Update Federated Learning: A Lattice Coding Approach

点击查看摘要

[LG-37] LAMP: Learnable Meta-Path Guided Adversarial Contrastive Learning for Heterogeneous Graphs

点击查看摘要

[LG-38] Rate-Constrained Quantization for Communication-Efficient Federated Learning

链接: https://arxiv.org/abs/2409.06319
作者: Shayan Mohajer Hamidi,Ali Bereyhi
关键词-EN: common approach, approach to mitigate, federated learning, Huffman coding, encoded gradients
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Quantization is a common approach to mitigate the communication cost of federated learning (FL). In practice, the quantized local parameters are further encoded via an entropy coding technique, such as Huffman coding, for efficient data compression. In this case, the exact communication overhead is determined by the bit rate of the encoded gradients. Recognizing this fact, this work deviates from the existing approaches in the literature and develops a novel quantized FL framework, called \textbfrate-\textbfconstrained \textbffederated learning (RC-FED), in which the gradients are quantized subject to both fidelity and data rate constraints. We formulate this scheme, as a joint optimization in which the quantization distortion is minimized while the rate of encoded gradients is kept below a target threshold. This enables for a tunable trade-off between quantization distortion and communication cost. We analyze the convergence behavior of RC-FED, and show its superior performance against baseline quantized FL schemes on several datasets.

[LG-39] PharmacoMatch: Efficient 3D Pharmacophore Screening through Neural Subgraph Matching

点击查看摘要

[LG-40] User Preferences for Large Language Model versus Template-Based Explanations of Movie Recommendations: A Pilot Study

点击查看摘要

[LG-41] Learning Augmentation Policies from A Model Zoo for Time Series Forecasting

链接: https://arxiv.org/abs/2409.06282
作者: Haochen Yuan,Xuelin Li,Yunbo Wang,Xiaokang Yang
关键词-EN: Time series forecasting, specific patterns present, Time series, models typically rely, challenging training samples
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series forecasting models typically rely on a fixed-size training set and treat all data uniformly, which may not effectively capture the specific patterns present in more challenging training samples. To address this issue, we introduce AutoTSAug, a learnable data augmentation method based on reinforcement learning. Our approach begins with an empirical analysis to determine which parts of the training data should be augmented. Specifically, we identify the so-called marginal samples by considering the prediction diversity across a set of pretrained forecasting models. Next, we propose using variational masked autoencoders as the augmentation model and applying the REINFORCE algorithm to transform the marginal samples into new data. The goal of this generative model is not only to mimic the distribution of real data but also to reduce the variance of prediction errors across the model zoo. By augmenting the marginal samples with a learnable policy, AutoTSAug substantially improves forecasting performance, advancing the prior art in this field with minimal additional computational cost.

[LG-42] Ferret: Federated Full-Parameter Tuning at Scale for Large Language Models

点击查看摘要

[LG-43] owards Robust Uncertainty-Aware Incomplete Multi-View Classification

点击查看摘要

[LG-44] Market Reaction to News Flows in Supply Chain Networks

链接: https://arxiv.org/abs/2409.06255
作者: Hiroyasu Inoue,Yasuyuki Todo
关键词-EN: Japanese listed firms, publicly listed firms, stock prices, listed firms, Japanese listed
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study examines whether positive news about firms increases their stock prices and, moreover, whether it increases stock prices of the firms’ suppliers and customers, using a large sample of publicly listed firms across the world and another of Japanese listed firms. The level of positiveness of each news article is determined by FinBERT, a natural language processing model fine-tuned specifically for financial information. Supply chains of firms across the world are identified mostly by financial statements, while those of Japanese firms are taken from large-scale firm-level surveys. We find that positive news increases the change rate of stock prices of firms mentioned in the news before its disclosure, most likely because of diffusion of information through informal channels. Positive news also raises stock prices of the firms’ suppliers and customers before its disclosure, confirming propagation of market values through supply chains. In addition, we generally find a larger post-news effect on stock prices of the mentioned firms and their suppliers and customers than the pre-news effect. The positive difference between the post- and pre-news effects can be considered as the net effect of the disclosure of positive news, controlling for informal information diffusion. However, the post-news effect on suppliers and customers in Japan is smaller than the pre-news effect, a result opposite to those from firms across the world. This notable result is possibly because supply chain links of Japanese firms are stronger than global supply chains while such knowledge is restricted to selected investors.

[LG-45] DiPT: Enhancing LLM reasoning through diversified perspective-taking

点击查看摘要

[LG-46] Recurrent Neural Networks for Still Images

点击查看摘要

[LG-47] NLP-Powered Repository and Search Engine for Academic Papers: A Case Study on Cyber Risk Literature with CyLit

链接: https://arxiv.org/abs/2409.06226
作者: Linfeng Zhang,Changyue Hu,Zhiyu Quan
关键词-EN: face increasing difficulties, researchers face increasing, academic literature continues, academic literature, continues to grow
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Risk Management (q-fin.RM)
*备注:

点击查看摘要

[LG-48] MIP-GAF: A MLLM-annotated Benchmark for Most Important Person Localization and Group Context Understanding WACV2025

点击查看摘要

[LG-49] Denoising: A Powerful Building-Block for Imaging Inverse Problems and Machine Learning

点击查看摘要

[LG-50] STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning

链接: https://arxiv.org/abs/2409.06211
作者: Jaeseong Lee,seung-won hwang,Aurick Qiao,Daniel F Campos,Zhewei Yao,Yuxiong He
关键词-EN: Large language models, reducing inference costs, Large language, sparsely activating experts, pruning
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

[LG-51] Adaptive Transformer Modelling of Density Function for Nonparametric Survival Analysis

点击查看摘要

[LG-52] MTDA-HSED: Mutual-Assistance Tuning and Dual-Branch Aggregating for Heterogeneous Sound Event Detection ICASSP2025

链接: https://arxiv.org/abs/2409.06196
作者: Zehao Wang,Haobo Yue,Zhicheng Zhang,Da Mu,Jin Tang,Jianqin Yin
关键词-EN: Sound Event Detection, perceiving acoustic scenes, Heterogeneous Sound Event, Event Detection, Sound Event
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Submit to Icassp2025

点击查看摘要

Abstract:Sound Event Detection (SED) plays a vital role in comprehending and perceiving acoustic scenes. Previous methods have demonstrated impressive capabilities. However, they are deficient in learning features of complex scenes from heterogeneous dataset. In this paper, we introduce a novel dual-branch architecture named Mutual-Assistance Tuning and Dual-Branch Aggregating for Heterogeneous Sound Event Detection (MTDA-HSED). The MTDA-HSED architecture employs the Mutual-Assistance Audio Adapter (M3A) to effectively tackle the multi-scenario problem and uses the Dual-Branch Mid-Fusion (DBMF) module to tackle the multi-granularity problem. Specifically, M3A is integrated into the BEATs block as an adapter to improve the BEATs’ performance by fine-tuning it on the multi-scenario dataset. The DBMF module connects BEATs and CNN branches, which facilitates the deep fusion of information from the BEATs and the CNN branches. Experimental results show that the proposed methods exceed the baseline of mpAUC by \textbf 5% on the DESED and MAESTRO Real datasets. Code is \hrefthis https URLhere.

[LG-53] Bottleneck-based Encoder-decoder ARchitecture (BEAR) for Learning Unbiased Consumer-to-Consumer Image Representations ICML ALT

点击查看摘要

[LG-54] Can Large Language Models Unlock Novel Scientific Research Ideas?

链接: https://arxiv.org/abs/2409.06185
作者: Sandeep Kumar,Tirthankar Ghosal,Vinayak Goyal,Asif Ekbal
关键词-EN: future research ideas, research ideas, Artificial Intelligence, Large Language Models, future research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 24 pages, 12 figures, 6 tables

点击查看摘要

[LG-55] Loss Distillation via Gradient Matching for Point Cloud Completion with Weighted Chamfer Distance IROS

点击查看摘要

[LG-56] VE: Modeling Multivariate Time Series Correlation with Variate Embedding

链接: https://arxiv.org/abs/2409.06169
作者: Shangjiong Wang,Zhihong Man,Zhengwei Cao,Jinchuan Zheng,Zhikang Ge
关键词-EN: Multivariate time series, time series forecasting, series forecasting relies, time series, relies on accurately
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multivariate time series forecasting relies on accurately capturing the correlations among variates. Current channel-independent (CI) models and models with a CI final projection layer are unable to capture these dependencies. In this paper, we present the variate embedding (VE) pipeline, which learns a unique and consistent embedding for each variate and combines it with Mixture of Experts (MoE) and Low-Rank Adaptation (LoRA) techniques to enhance forecasting performance while controlling parameter size. The VE pipeline can be integrated into any model with a CI final projection layer to improve multivariate forecasting. The learned VE effectively groups variates with similar temporal patterns and separates those with low correlations. The effectiveness of the VE pipeline is demonstrated through extensive experiments on four widely-used datasets. The code is available at: \urlthis https URL.

[LG-57] MCDGLN: Masked Connection-based Dynamic Graph Learning Network for Autism Spectrum Disorder

点击查看摘要

[LG-58] Causal Analysis of Shapley Values: Conditional vs. Marginal

链接: https://arxiv.org/abs/2409.06157
作者: Ilya Rozenfeld
关键词-EN: explaining Machine Learning, game theoretic concept, Machine Learning, explaining Machine, theoretic concept
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Shapley values, a game theoretic concept, has been one of the most popular tools for explaining Machine Learning (ML) models in recent years. Unfortunately, the two most common approaches, conditional and marginal, to calculating Shapley values can lead to different results along with some undesirable side effects when features are correlated. This in turn has led to the situation in the literature where contradictory recommendations regarding choice of an approach are provided by different authors. In this paper we aim to resolve this controversy through the use of causal arguments. We show that the differences arise from the implicit assumptions that are made within each method to deal with missing causal information. We also demonstrate that the conditional approach is fundamentally unsound from a causal perspective. This, together with previous work in [1], leads to the conclusion that the marginal approach should be preferred over the conditional one.

[LG-59] Configuration Interaction Guided Sampling with Interpretable Restricted Boltzmann Machine

链接: https://arxiv.org/abs/2409.06146
作者: Jorge I. Hernandez-Martinez,Gerardo Rodriguez-Hernandez,Andres Mendez-Vazquez
关键词-EN: Restricted Boltzmann Machine, Boltzmann Machine, Restricted Boltzmann, solve the Schrödinger, Schrödinger equation
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: Preprint to be submitted to Computer Physics Communications

点击查看摘要

Abstract:We propose a data-driven approach using a Restricted Boltzmann Machine (RBM) to solve the Schrödinger equation in configuration space. Traditional Configuration Interaction (CI) methods, while powerful, are computationally expensive due to the large number of determinants required. Our approach leverages RBMs to efficiently identify and sample the most significant determinants, accelerating convergence and reducing computational cost. This method achieves up to 99.99% of the correlation energy even by four orders of magnitude less determinants compared to full CI calculations and up to two orders of magnitude less than previous state of the art works. Additionally, our study demonstrate that the RBM can learn the underlying quantum properties, providing more detail insights than other methods . This innovative data-driven approach offers a promising tool for quantum chemistry, enhancing both efficiency and understanding of complex systems.

[LG-60] DECOLLAGE: 3D Detailization by Controllable Localized and Learned Geometry Enhancement ECCV2024

点击查看摘要

[LG-61] Contrastive Federated Learning with Tabular Data Silos

链接: https://arxiv.org/abs/2409.06123
作者: Achmad Ginanjar,Xue Li,Wen Hua
关键词-EN: contrastive federated learning, Federated Learning, data silos, Learning, difficult task
类目: Machine Learning (cs.LG)
*备注: 18 Pages. Was submitted on Artificial Intelligence Journal, Jan 29, 2024, ARTINT-D-24-00098

点击查看摘要

Abstract:Learning from data silos is a difficult task for organizations that need to obtain knowledge of objects that appeared in multiple independent data silos. Objects in multi-organizations, such as government agents, are referred by different identifiers, such as driver license, passport number, and tax file number. The data distributions in data silos are mostly non-IID (Independently and Identically Distributed), labelless, and vertically partitioned (i.e., having different attributes). Privacy concerns harden the above issues. Conditions inhibit enthusiasm for collaborative work. While Federated Learning (FL) has been proposed to address these issues, the difficulty of labeling, namely, label costliness, often hinders optimal model performance. A potential solution lies in contrastive learning, an unsupervised self-learning technique to represent semantic data by contrasting similar data pairs. However, contrastive learning is currently not designed to handle tabular data silos that existed within multiple organizations where data linkage by quasi identifiers are needed. To address these challenges, we propose using semi-supervised contrastive federated learning, which we refer to as Contrastive Federated Learning with Data Silos (CFL). Our approach tackles the aforementioned issues with an integrated solution. Our experimental results demonstrate that CFL outperforms current methods in addressing these challenges and providing improvements in accuracy. Additionally, we present positive results that showcase the advantages of our contrastive federated learning approach in complex client environments.

[LG-62] Scalable Multitask Learning Using Gradient-based Estimation of Task Affinity

点击查看摘要

[LG-63] Differentiable programming across the PDE and Machine Learning barrier

链接: https://arxiv.org/abs/2409.06085
作者: Nacime Bouziani,David A. Ham,Ado Farsi
关键词-EN: shown immense potential, scientific problems driven, machine learning, partial differential equations, solving scientific problems
类目: Machine Learning (cs.LG); Mathematical Software (cs.MS); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:The combination of machine learning and physical laws has shown immense potential for solving scientific problems driven by partial differential equations (PDEs) with the promise of fast inference, zero-shot generalisation, and the ability to discover new physics. Examples include the use of fundamental physical laws as inductive bias to machine learning algorithms, also referred to as physics-driven machine learning, and the application of machine learning to represent features not represented in the differential equations such as closures for unresolved spatiotemporal scales. However, the simulation of complex physical systems by coupling advanced numerics for PDEs with state-of-the-art machine learning demands the composition of specialist PDE solving frameworks with industry-standard machine learning tools. Hand-rolling either the PDE solver or the neural net will not cut it. In this work, we introduce a generic differentiable programming abstraction that provides scientists and engineers with a highly productive way of specifying end-to-end differentiable models coupling machine learning and PDE-based components, while relying on code generation for high performance. Our interface automates the coupling of arbitrary PDE-based systems and machine learning models and unlocks new applications that could not hitherto be tackled, while only requiring trivial changes to existing code. Our framework has been adopted in the Firedrake finite-element library and supports the PyTorch and JAX ecosystems, as well as downstream libraries.

[LG-64] Symmetry constrained neural networks for detection and localization of damage in metal plates

链接: https://arxiv.org/abs/2409.06084
作者: James Amarel,Christopher Rudolf,Athanasios Iliopoulos,John Michopoulos,Leslie N. Smith
关键词-EN: deep learning techniques, learning techniques applied, thin aluminum plate, present paper, paper is concerned
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:The present paper is concerned with deep learning techniques applied to detection and localization of damage in a thin aluminum plate. We used data generated on a tabletop apparatus by mounting to the plate four piezoelectric transducers, each of which took turn to generate a Lamb wave that then traversed the region of interest before being received by the remaining three sensors. On training a neural network to analyze time-series data of the material response, which displayed damage-reflective features whenever the plate guided waves interacted with a contact load, we achieved a model that detected with greater than 99% accuracy in addition to a model that localized with 3.14 \pm 0.21 mm mean distance error and captured more than 60% of test examples within the diffraction limit. For each task, the best-performing model was designed according to the inductive bias that our transducers were both similar and arranged in a square pattern on a nearly uniform plate.

[LG-65] MTLSO: A Multi-Task Learning Approach for Logic Synthesis Optimization

点击查看摘要

[LG-66] DetoxBench: Benchmarking Large Language Models for Multitask Fraud Abuse Detection

链接: https://arxiv.org/abs/2409.06072
作者: Joymallya Chakraborty,Wei Xia,Anirban Majumder,Dan Ma,Walid Chaabene,Naveed Janvekar
关键词-EN: Large language models, demonstrated remarkable capabilities, natural language processing, Large language, demonstrated remarkable
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 12 pages

点击查看摘要

[LG-67] Privacy-Preserving Data Linkage Across Private and Public Datasets for Collaborative Agriculture Research

点击查看摘要

[LG-68] Statistical Mechanics of Min-Max Problems

链接: https://arxiv.org/abs/2409.06053
作者: Yuma Ichikawa,Koji Hukushima
关键词-EN: generative adversarial networks, attracted significant attention, significant attention due, saddle point problems, Min-max optimization problems
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 16 pages, 1 figures

点击查看摘要

Abstract:Min-max optimization problems, also known as saddle point problems, have attracted significant attention due to their applications in various fields, such as fair beamforming, generative adversarial networks (GANs), and adversarial learning. However, understanding the properties of these min-max problems has remained a substantial challenge. This study introduces a statistical mechanical formalism for analyzing the equilibrium values of min-max problems in the high-dimensional limit, while appropriately addressing the order of operations for min and max. As a first step, we apply this formalism to bilinear min-max games and simple GANs, deriving the relationship between the amount of training data and generalization error and indicating the optimal ratio of fake to real data for effective learning. This formalism provides a groundwork for a deeper theoretical analysis of the equilibrium properties in various machine learning methods based on min-max problems and encourages the development of new algorithms and architectures.

[LG-69] Adapting to Shifting Correlations with Unlabeled Data Calibration ECCV

链接: https://arxiv.org/abs/2409.05996
作者: Minh Nguyen,Alan Q. Wang,Heejong Kim,Mert R. Sabuncu
关键词-EN: Distribution shifts, degrade model performance, prone to exploiting, unstable features, exploiting unstable correlations
类目: Machine Learning (cs.LG)
*备注: Accepted at ECCV

点击查看摘要

Abstract:Distribution shifts between sites can seriously degrade model performance since models are prone to exploiting unstable correlations. Thus, many methods try to find features that are stable across sites and discard unstable features. However, unstable features might have complementary information that, if used appropriately, could increase accuracy. More recent methods try to adapt to unstable features at the new sites to achieve higher accuracy. However, they make unrealistic assumptions or fail to scale to multiple confounding features. We propose Generalized Prevalence Adjustment (GPA for short), a flexible method that adjusts model predictions to the shifting correlations between prediction target and confounders to safely exploit unstable features. GPA can infer the interaction between target and confounders in new sites using unlabeled samples from those sites. We evaluate GPA on several real and synthetic datasets, and show that it outperforms competitive baselines.

[LG-70] FairHome: A Fair Housing and Fair Lending Dataset

链接: https://arxiv.org/abs/2409.05990
作者: Anusha Bagalkotkar(1),Aveek Karmakar(1),Gabriel Arnson(1),Ondrej Linda(1) ((1) Zillow Group)
关键词-EN: Fair Lending dataset, Fair Lending, Fair Housing, protected categories, Lending dataset
类目: Machine Learning (cs.LG)
*备注: 14 pages, 5 figures

点击查看摘要

Abstract:We present a Fair Housing and Fair Lending dataset (FairHome): A dataset with around 75,000 examples across 9 protected categories. To the best of our knowledge, FairHome is the first publicly available dataset labeled with binary labels for compliance risk in the housing domain. We demonstrate the usefulness and effectiveness of such a dataset by training a classifier and using it to detect potential violations when using a large language model (LLM) in the context of real-estate transactions. We benchmark the trained classifier against state-of-the-art LLMs including GPT-3.5, GPT-4, LLaMA-3, and Mistral Large in both zero-shot and few-shot contexts. Our classifier outperformed with an F1-score of 0.91, underscoring the effectiveness of our dataset.

[LG-71] A Comprehensive Comparison Between ANNs and KANs For Classifying EEG Alzheimers Data

点击查看摘要

[LG-72] FLoRA: Federated Fine-Tuning Large Language Models with Heterogeneous Low-Rank Adaptations

链接: https://arxiv.org/abs/2409.05976
作者: Ziyao Wang,Zheyu Shen,Yexiao He,Guoheng Sun,Hongyi Wang,Lingjuan Lyu,Ang Li
关键词-EN: Large Language Models, Language Models, Large Language, diverse downstream tasks, development of Large
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:The rapid development of Large Language Models (LLMs) has been pivotal in advancing AI, with pre-trained LLMs being adaptable to diverse downstream tasks through fine-tuning. Federated learning (FL) further enhances fine-tuning in a privacy-aware manner by utilizing clients’ local data through in-situ computation, eliminating the need for data movement. However, fine-tuning LLMs, given their massive scale of parameters, poses challenges for clients with constrained and heterogeneous resources in FL. Previous methods employed low-rank adaptation (LoRA) for efficient federated fine-tuning but utilized traditional FL aggregation strategies on LoRA adapters. These approaches led to mathematically inaccurate aggregation noise, reducing fine-tuning effectiveness and failing to address heterogeneous LoRAs. In this work, we first highlight the mathematical incorrectness of LoRA aggregation in existing federated fine-tuning methods. We introduce a new approach called FLORA that enables federated fine-tuning on heterogeneous LoRA adapters across clients through a novel stacking-based aggregation method. Our approach is noise-free and seamlessly supports heterogeneous LoRA adapters. Extensive experiments demonstrate FLORA’ s superior performance in both homogeneous and heterogeneous settings, surpassing state-of-the-art methods. We envision this work as a milestone for efficient, privacy-preserving, and accurate federated fine-tuning of LLMs. Our code is available at this https URL.

[LG-73] CoDiCast: Conditional Diffusion Model for Weather Prediction with Uncertainty Quantification

链接: https://arxiv.org/abs/2409.05975
作者: Jimeng Shi,Bowen Jin,Jiawei Han,Giri Narasimhan
关键词-EN: weather prediction, weather, science and society, critical for science, prediction
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Accurate weather forecasting is critical for science and society. Yet, existing methods have not managed to simultaneously have the properties of high accuracy, low uncertainty, and high computational efficiency. On one hand, to quantify the uncertainty in weather predictions, the strategy of ensemble forecast (i.e., generating a set of diverse predictions) is often employed. However, traditional ensemble numerical weather prediction (NWP) is computationally intensive. On the other hand, most existing machine learning-based weather prediction (MLWP) approaches are efficient and accurate. Nevertheless, they are deterministic and cannot capture the uncertainty of weather forecasting. In this work, we propose CoDiCast, a conditional diffusion model to generate accurate global weather prediction, while achieving uncertainty quantification with ensemble forecasts and modest computational cost. The key idea is to simulate a conditional version of the reverse denoising process in diffusion models, which starts from pure Gaussian noise to generate realistic weather scenarios for a future time point. Each denoising step is conditioned on observations from the recent past. Ensemble forecasts are achieved by repeatedly sampling from stochastic Gaussian noise to represent uncertainty quantification. CoDiCast is trained on a decade of ERA5 reanalysis data from the European Centre for Medium-Range Weather Forecasts (ECMWF). Experimental results demonstrate that our approach outperforms several existing data-driven methods in accuracy. Our conditional diffusion model, CoDiCast, can generate 3-day global weather forecasts, at 6-hour steps and 5.625^\circ latitude-longitude resolution, for over 5 variables, in about 12 minutes on a commodity A100 GPU machine with 80GB memory. The open-souced code is provided at \urlthis https URL.

[LG-74] A Small Claims Court for the NLP: Judging Legal Text Classification Strategies With Small Datasets

链接: https://arxiv.org/abs/2409.05972
作者: Mariana Yukari Noguti,Edduardo Vellasques,Luiz Eduardo Soares Oliveira
关键词-EN: Recent advances, text classification tasks, labelled data, modelling has significantly, significantly decreased
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-75] Predicting Electricity Consumption with Random Walks on Gaussian Processes

链接: https://arxiv.org/abs/2409.05934
作者: Chloé Hashimoto-Cullen,Benjamin Guedj
关键词-EN: time-series forecasting problems, prohibitive computational cost, difficult to gather, data is scarce, time-series forecasting
类目: Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注: 6 pages

点击查看摘要

Abstract:We consider time-series forecasting problems where data is scarce, difficult to gather, or induces a prohibitive computational cost. As a first attempt, we focus on short-term electricity consumption in France, which is of strategic importance for energy suppliers and public stakeholders. The complexity of this problem and the many levels of geospatial granularity motivate the use of an ensemble of Gaussian Processes (GPs). Whilst GPs are remarkable predictors, they are computationally expensive to train, which calls for a frugal few-shot learning approach. By taking into account performance on GPs trained on a dataset and designing a random walk on these, we mitigate the training cost of our entire Bayesian decision-making procedure. We introduce our algorithm called \textscDomino (ranDOM walk on gaussIaN prOcesses) and present numerical experiments to support its merits.

[LG-76] Self-Supervised State Space Model for Real-Time Traffic Accident Prediction Using eKAN Networks

链接: https://arxiv.org/abs/2409.05933
作者: Xin Tan,Meng Zhao
关键词-EN: Accurate prediction, public safety, times and regions, regions is vital, vital for public
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate prediction of traffic accidents across different times and regions is vital for public safety. However, existing methods face two key challenges: 1) Generalization: Current models rely heavily on manually constructed multi-view structures, like POI distributions and road network densities, which are labor-intensive and difficult to scale across cities. 2) Real-Time Performance: While some methods improve accuracy with complex architectures, they often incur high computational costs, limiting their real-time applicability. To address these challenges, we propose SSL-eKamba, an efficient self-supervised framework for traffic accident prediction. To enhance generalization, we design two self-supervised auxiliary tasks that adaptively improve traffic pattern representation through spatiotemporal discrepancy awareness. For real-time performance, we introduce eKamba, an efficient model that redesigns the Kolmogorov-Arnold Network (KAN) architecture. This involves using learnable univariate functions for input activation and applying a selective mechanism (Selective SSM) to capture multi-variate correlations, thereby improving computational efficiency. Extensive experiments on two real-world datasets demonstrate that SSL-eKamba consistently outperforms state-of-the-art baselines. This framework may also offer new insights for other spatiotemporal tasks. Our source code is publicly available at this http URL.

[LG-77] Alt-MoE: Multimodal Alignment via Alternating Optimization of Multi-directional MoE with Unimodal Models

点击查看摘要

[LG-78] Machine Learning Based Optimal Design of Fibrillar Adhesives

链接: https://arxiv.org/abs/2409.05928
作者: Mohammad Shojaeifard,Matteo Ferraresso,Alessandro Lucantonio,Mattia Bacca
关键词-EN: enhance surface adhesion, contact splitting, observed in animals, animals like beetles, relies on nanoscopic
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fibrillar adhesion, observed in animals like beetles, spiders, and geckos, relies on nanoscopic or microscopic fibrils to enhance surface adhesion via ‘contact splitting.’ This concept has inspired engineering applications across robotics, transportation, and medicine. Recent studies suggest that functional grading of fibril properties can improve adhesion, but this is a complex design challenge that has only been explored in simplified geometries. While machine learning (ML) has gained traction in adhesive design, no previous attempts have targeted fibril-array scale optimization. In this study, we propose an ML-based tool that optimizes the distribution of fibril compliance to maximize adhesive strength. Our tool, featuring two deep neural networks (DNNs), recovers previous design results for simple geometries and introduces novel solutions for complex configurations. The Predictor DNN estimates adhesive strength based on random compliance distributions, while the Designer DNN optimizes compliance for maximum strength using gradient-based optimization. Our method significantly reduces test error and accelerates the optimization process, offering a high-performance solution for designing fibrillar adhesives and micro-architected materials aimed at fracture resistance by achieving equal load sharing (ELS).

[LG-79] SVFit: Parameter-Efficient Fine-Tuning of Large Pre-Trained Models Using Singular Values

链接: https://arxiv.org/abs/2409.05926
作者: Chengwei Sun,Jiwei Wei,Yujia Wu,Yiming Shi,Shiyuan He,Zeyu Ma,Ning Xie,Yang Yang
关键词-EN: demonstrated exceptional performance, Large pre-trained models, computer vision tasks, Large pre-trained, demonstrated exceptional
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large pre-trained models (LPMs) have demonstrated exceptional performance in diverse natural language processing and computer vision tasks. However, fully fine-tuning these models poses substantial memory challenges, particularly in resource-constrained environments. Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, mitigate this issue by adjusting only a small subset of parameters. Nevertheless, these methods typically employ random initialization for low-rank matrices, which can lead to inefficiencies in gradient descent and diminished generalizability due to suboptimal starting points. To address these limitations, we propose SVFit, a novel PEFT approach that leverages singular value decomposition (SVD) to initialize low-rank matrices using critical singular values as trainable parameters. Specifically, SVFit performs SVD on the pre-trained weight matrix to obtain the best rank-r approximation matrix, emphasizing the most critical singular values that capture over 99% of the matrix’s information. These top-r singular values are then used as trainable parameters to scale the fundamental subspaces of the matrix, facilitating rapid domain adaptation. Extensive experiments across various pre-trained models in natural language understanding, text-to-image generation, and image classification tasks reveal that SVFit outperforms LoRA while requiring 16 times fewer trainable parameters.

[LG-80] STLLM-DF: A Spatial-Temporal Large Language Model with Diffusion for Enhanced Multi-Mode Traffic System Forecasting

点击查看摘要

[LG-81] Developing an Explainable Artificial Intelligent (XAI) Model for Predicting Pile Driving Vibrations in Bangkoks Subsoil

链接: https://arxiv.org/abs/2409.05918
作者: Sompote Youwai,Anuwat Pamungmoon
关键词-EN: explainable artificial intelligent, Bangkok soft clay, pile driving vibrations, pile driving, artificial intelligent
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study presents an explainable artificial intelligent (XAI) model for predicting pile driving vibrations in Bangkok’s soft clay subsoil. A deep neural network was developed using a dataset of 1,018 real-world pile driving measurements, encompassing variations in pile dimensions, hammer characteristics, sensor locations, and vibration measurement axes. The model achieved a mean absolute error (MAE) of 0.276, outperforming traditional empirical methods and other machine learning approaches such as XGBoost and CatBoost. SHapley Additive exPlanations (SHAP) analysis was employed to interpret the model’s predictions, revealing complex relationships between input features and peak particle velocity (PPV). Distance from the pile driving location emerged as the most influential factor, followed by hammer weight and pile size. Non-linear relationships and threshold effects were observed, providing new insights into vibration propagation in soft clay. A web-based application was developed to facilitate adoption by practicing engineers, bridging the gap between advanced machine learning techniques and practical engineering applications. This research contributes to the field of geotechnical engineering by offering a more accurate and nuanced approach to predicting pile driving vibrations, with implications for optimizing construction practices and mitigating environmental impacts in urban areas. The model and its source code are publicly available, promoting transparency and reproducibility in geotechnical research.

[LG-82] Faster Q-Learning Algorithms for Restless Bandits

链接: https://arxiv.org/abs/2409.05908
作者: Parvish Kakarapalli,Devendra Kayande,Rahul Meshram
关键词-EN: restless multi-armed bandits, Q-learning, Whittle index learning, UCB, Whittle index
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 7 pages, 3 figures, conference. arXiv admin note: substantial text overlap with arXiv:2409.04605

点击查看摘要

Abstract:We study the Whittle index learning algorithm for restless multi-armed bandits (RMAB). We first present Q-learning algorithm and its variants – speedy Q-learning (SQL), generalized speedy Q-learning (GSQL) and phase Q-learning (PhaseQL). We also discuss exploration policies – \epsilon -greedy and Upper confidence bound (UCB). We extend the study of Q-learning and its variants with UCB policy. We illustrate using numerical example that Q-learning with UCB exploration policy has faster convergence and PhaseQL with UCB have fastest convergence rate. We next extend the study of Q-learning variants for index learning to RMAB. The algorithm of index learning is two-timescale variant of stochastic approximation, on slower timescale we update index learning scheme and on faster timescale we update Q-learning assuming fixed index value. We study constant stepsizes two timescale stochastic approximation algorithm. We describe the performance of our algorithms using numerical example. It illustrate that index learning with Q learning with UCB has faster convergence that \epsilon greedy. Further, PhaseQL (with UCB and \epsilon greedy) has the best convergence than other Q-learning algorithms.

[LG-83] Programming Refusal with Conditional Activation Steering

链接: https://arxiv.org/abs/2409.05907
作者: Bruce W. Lee,Inkit Padhi,Karthikeyan Natesan Ramamurthy,Erik Miehling,Pierre Dognin,Manish Nagireddy,Amit Dhurandhar
关键词-EN: shown remarkable capabilities, behavior remains challenging, remarkable capabilities, remains challenging, activation steering
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[LG-84] owards Narrowing the Generalization Gap in Deep Boolean Networks

点击查看摘要

[LG-85] OPAL: Outlier-Preserved Microscaling Quantization A ccelerator for Generative Large Language Models

链接: https://arxiv.org/abs/2409.05902
作者: Jahyun Koo,Dahoon Park,Sangwoo Jung,Jaeha Kung
关键词-EN: large language models, aggressive weight quantization, language models, aggressive weight, recently studied
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: 7 pages, 8 figures, DAC2024 accepted

点击查看摘要

Abstract:To overcome the burden on the memory size and bandwidth due to ever-increasing size of large language models (LLMs), aggressive weight quantization has been recently studied, while lacking research on quantizing activations. In this paper, we present a hardware-software co-design method that results in an energy-efficient LLM accelerator, named OPAL, for generation tasks. First of all, a novel activation quantization method that leverages the microscaling data format while preserving several outliers per sub-tensor block (e.g., four out of 128 elements) is proposed. Second, on top of preserving outliers, mixed precision is utilized that sets 5-bit for inputs to sensitive layers in the decoder block of an LLM, while keeping inputs to less sensitive layers to 3-bit. Finally, we present the OPAL hardware architecture that consists of FP units for handling outliers and vectorized INT multipliers for dominant non-outlier related operations. In addition, OPAL uses log2-based approximation on softmax operations that only requires shift and subtraction to maximize power efficiency. As a result, we are able to improve the energy efficiency by 1.6~2.2x, and reduce the area by 2.4~3.1x with negligible accuracy loss, i.e., 1 perplexity increase.

[LG-86] Fast (sim N) Diffusion Map Algorithm

链接: https://arxiv.org/abs/2409.05901
作者: Julio Candanedo
关键词-EN: specifically for Diffusion-maps, explore parsimonious manifold, parsimonious manifold learning, work we explore, explore parsimonious
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work we explore parsimonious manifold learning techniques, specifically for Diffusion-maps. We demonstrate an algorithm and it’s implementation with computational complexity (in both time and memory) of \sim N , with N representing the number-of-samples. These techniques are essential for large-scale unsupervised learning tasks without any prior assumptions, due to sampling theorem limitations.

[LG-87] Memory-Optimized Once-For-All Network

点击查看摘要

[LG-88] Simplex-enabled Safe Continual Learning Machine

点击查看摘要

[LG-89] A Dual-Path neural network model to construct the flame nonlinear thermoacoustic response in the time domain

链接: https://arxiv.org/abs/2409.05885
作者: Jiawei Wu,Teng Wang,Jiaqi Nan,Lijun Yang,Jingxuan Li
关键词-EN: Traditional numerical simulation, methods require substantial, require substantial computational, substantial computational resources, simulation methods require
类目: Machine Learning (cs.LG)
*备注: 23 pages 14figures, 1 supplemmentary meterial

点击查看摘要

Abstract:Traditional numerical simulation methods require substantial computational resources to accurately determine the complete nonlinear thermoacoustic response of flames to various perturbation frequencies and amplitudes. In this paper, we have developed deep learning algorithms that can construct a comprehensive flame nonlinear response from limited numerical simulation data. To achieve this, we propose using a frequency-sweeping data type as the training dataset, which incorporates a rich array of learnable information within a constrained dataset. To enhance the precision in learning flame nonlinear response patterns from the training data, we introduce a Dual-Path neural network. This network consists of a Chronological Feature Path and a Temporal Detail Feature Path. The Dual-Path network is specifically designed to focus intensively on the temporal characteristics of velocity perturbation sequences, yielding more accurate flame response patterns and enhanced generalization capabilities. Validations confirm that our approach can accurately model flame nonlinear responses, even under conditions of significant nonlinearity, and exhibits robust generalization capabilities across various test scenarios.

[LG-90] Integrating the Expected Future: Schedule Based Energy Forecasting

链接: https://arxiv.org/abs/2409.05884
作者: Raffael Theiler,Olga Fink
关键词-EN: Power grid operators, grid operators depend, Power grid, reliable energy forecasts, aiming to minimize
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 22 pages, 8 figures and tables, journal paper

点击查看摘要

Abstract:Power grid operators depend on accurate and reliable energy forecasts, aiming to minimize cases of extreme errors, as these outliers are particularly challenging to manage during operation. Incorporating planning information – such as known data about users’ future behavior or scheduled events – has the potential to significantly enhance the accuracy and specificity of forecasts. Although there have been attempts to integrate such expected future behavior, these efforts consistently rely on conventional regression models to process this information. These models often lack the flexibility and capability to effectively incorporate both dynamic, forward-looking contextual inputs and historical data. To address this challenge, we conceptualize this combined forecasting and regression challenge as a sequence-to-sequence modeling problem and demonstrate, with three distinct models, that our contextually enhanced transformer models excel in this task. By leveraging schedule-based contextual information from the Swiss railway traction network, our proposed method significantly improved the average forecasting accuracy of nationwide railway energy consumption. Specifically, enhancing the transformer models with contextual information resulted in an average reduction of mean absolute error by 40.6% , whereas other state-of-the-art methods did not demonstrate any significant improvement.

[LG-91] CF-KAN: Kolmogorov-Arnold Network-based Collaborative Filtering to Mitigate Catastrophic Forgetting in Recommender Systems

链接: https://arxiv.org/abs/2409.05878
作者: Jin-Duk Park,Kyung-Min Kim,Won-Yong Shin
关键词-EN: Collaborative filtering, remains essential, recommender systems, provide personalized recommendations, essential in recommender
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 9 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Collaborative filtering (CF) remains essential in recommender systems, leveraging user–item interactions to provide personalized recommendations. Meanwhile, a number of CF techniques have evolved into sophisticated model architectures based on multi-layer perceptrons (MLPs). However, MLPs often suffer from catastrophic forgetting, and thus lose previously acquired knowledge when new information is learned, particularly in dynamic environments requiring continual learning. To tackle this problem, we propose CF-KAN, a new CF method utilizing Kolmogorov-Arnold networks (KANs). By learning nonlinear functions on the edge level, KANs are more robust to the catastrophic forgetting problem than MLPs. Built upon a KAN-based autoencoder, CF-KAN is designed in the sense of effectively capturing the intricacies of sparse user–item interactions and retaining information from previous data instances. Despite its simplicity, our extensive experiments demonstrate 1) CF-KAN’s superiority over state-of-the-art methods in recommendation accuracy, 2) CF-KAN’s resilience to catastrophic forgetting, underscoring its effectiveness in both static and dynamic recommendation scenarios, and 3) CF-KAN’s edge-level interpretation facilitating the explainability of recommendations.

[LG-92] CSRec: Rethinking Sequential Recommendation from A Causal Perspective

链接: https://arxiv.org/abs/2409.05872
作者: Xiaoyu Liu,Jiaxin Yuan,Yuhang Zhou,Jingling Li,Furong Huang,Wei Ai
关键词-EN: users make decisions, lies in understanding, understanding how users, users make, Causal Sequential Recommendation
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The essence of sequential recommender systems (RecSys) lies in understanding how users make decisions. Most existing approaches frame the task as sequential prediction based on users’ historical purchase records. While effective in capturing users’ natural preferences, this formulation falls short in accurately modeling actual recommendation scenarios, particularly in accounting for how unsuccessful recommendations influence future purchases. Furthermore, the impact of the RecSys itself on users’ decisions has not been appropriately isolated and quantitatively analyzed. To address these challenges, we propose a novel formulation of sequential recommendation, termed Causal Sequential Recommendation (CSRec). Instead of predicting the next item in the sequence, CSRec aims to predict the probability of a recommended item’s acceptance within a sequential context and backtrack how current decisions are made. Critically, CSRec facilitates the isolation of various factors that affect users’ final decisions, especially the influence of the recommender system itself, thereby opening new avenues for the design of recommender systems. CSRec can be seamlessly integrated into existing methodologies. Experimental evaluations on both synthetic and real-world datasets demonstrate that the proposed implementation significantly improves upon state-of-the-art baselines.

[LG-93] Multi-feature Compensatory Motion Analysis for Reaching Motions Over a Discretely Sampled Workspace

链接: https://arxiv.org/abs/2409.05871
作者: Qihan Yang,Yuri Gloumakov,Adam J. Spiers
关键词-EN: users’ daily activities, extremity prostheses leads, functional arm joints, compensatory motions, upper extremity prostheses
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 7 pages, 12 figures. Accepted by IEEE RAS EMBS 10th International Conference on Biomedical Robotics and Biomechatronics (BioRob 2024)

点击查看摘要

Abstract:The absence of functional arm joints, such as the wrist, in upper extremity prostheses leads to compensatory motions in the users’ daily activities. Compensatory motions have been previously studied for varying task protocols and evaluation metrics. However, the movement targets’ spatial locations in previous protocols were not standardised and incomparable between studies, and the evaluation metrics were rudimentary. This work analysed compensatory motions in the final pose of subjects reaching across a discretely sampled 7*7 2D grid of targets under unbraced (normative) and braced (compensatory) conditions. For the braced condition, a bracing system was applied to simulate a transradial prosthetic limb by restricting participants’ wrist joints. A total of 1372 reaching poses were analysed, and a Compensation Index was proposed to indicate the severity level of compensation. This index combined joint spatial location analysis, joint angle analysis, separability analysis, and machine learning (clustering) analysis. The individual analysis results and the final Compensation Index were presented in heatmap format to correspond to the spatial layout of the workspace, revealing the spatial dependency of compensatory motions. The results indicate that compensatory motions occur mainly in a right trapezoid region in the upper left area and a vertical trapezoid region in the middle left area for right-handed subjects reaching horizontally and vertically. Such results might guide motion selection in clinical rehabilitation, occupational therapy, and prosthetic evaluation to help avoid residual limb pain and overuse syndromes.

[LG-94] A General Framework for Clustering and Distribution Matching with Bandit Feedback

链接: https://arxiv.org/abs/2409.05072
作者: Recep Can Yavas,Yuqi Huang,Vincent Y. F. Tan,Jonathan Scarlett
关键词-EN: arm pulls, arm, pulls, arms, bandit feedback
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注: 22 pages, submitted to Information Theory Transactions in September 2024

点击查看摘要

Abstract:We develop a general framework for clustering and distribution matching problems with bandit feedback. We consider a K -armed bandit model where some subset of K arms is partitioned into M groups. Within each group, the random variable associated to each arm follows the same distribution on a finite alphabet. At each time step, the decision maker pulls an arm and observes its outcome from the random variable associated to that arm. Subsequent arm pulls depend on the history of arm pulls and their outcomes. The decision maker has no knowledge of the distributions of the arms or the underlying partitions. The task is to devise an online algorithm to learn the underlying partition of arms with the least number of arm pulls on average and with an error probability not exceeding a pre-determined value \delta . Several existing problems fall under our general framework, including finding M pairs of arms, odd arm identification, and M -ary clustering of K arms belong to our general framework. We derive a non-asymptotic lower bound on the average number of arm pulls for any online algorithm with an error probability not exceeding \delta . Furthermore, we develop a computationally-efficient online algorithm based on the Track-and-Stop method and Frank–Wolfe algorithm, and show that the average number of arm pulls of our algorithm asymptotically matches that of the lower bound. Our refined analysis also uncovers a novel bound on the speed at which the average number of arm pulls of our algorithm converges to the fundamental limit as \delta vanishes.

[LG-95] Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens

链接: https://arxiv.org/abs/2409.06656
作者: Taejin Park,Ivan Medennikov,Kunal Dhawan,Weiqing Wang,He Huang,Nithin Rao Koluguri,Krishna C. Puvvada,Jagadeesh Balam,Boris Ginsburg
关键词-EN: unconventional objectives compared, compared to existing, Sort Loss, diarization, PIL
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

[LG-96] KANtrol: A Physics-Informed Kolmogorov-Arnold Network Framework for Solving Multi-Dimensional and Fractional Optimal Control Problems

链接: https://arxiv.org/abs/2409.06649
作者: Alireza Afzal Aghaei
关键词-EN: utilizes Kolmogorov-Arnold Networks, continuous time variables, involving continuous time, Kolmogorov-Arnold Networks, problems involving continuous
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:In this paper, we introduce the KANtrol framework, which utilizes Kolmogorov-Arnold Networks (KANs) to solve optimal control problems involving continuous time variables. We explain how Gaussian quadrature can be employed to approximate the integral parts within the problem, particularly for integro-differential state equations. We also demonstrate how automatic differentiation is utilized to compute exact derivatives for integer-order dynamics, while for fractional derivatives of non-integer order, we employ matrix-vector product discretization within the KAN framework. We tackle multi-dimensional problems, including the optimal control of a 2D heat partial differential equation. The results of our simulations, which cover both forward and parameter identification problems, show that the KANtrol framework outperforms classical MLPs in terms of accuracy and efficiency.

[LG-97] Interactive 3D Segmentation for Primary Gross Tumor Volume in Oropharyngeal Cancer

点击查看摘要

[LG-98] Advancing Causal Inference: A Nonparametric Approach to ATE and CATE Estimation with Continuous Treatments

链接: https://arxiv.org/abs/2409.06593
作者: Hugo Gobato Souto,Francisco Louzada Neto
关键词-EN: Conditional Average Treatment, Bayesian Causal Forest, Average Treatment Effect, Conditional Average, Average Treatment
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:This paper introduces a generalized ps-BART model for the estimation of Average Treatment Effect (ATE) and Conditional Average Treatment Effect (CATE) in continuous treatments, addressing limitations of the Bayesian Causal Forest (BCF) model. The ps-BART model’s nonparametric nature allows for flexibility in capturing nonlinear relationships between treatment and outcome variables. Across three distinct sets of Data Generating Processes (DGPs), the ps-BART model consistently outperforms the BCF model, particularly in highly nonlinear settings. The ps-BART model’s robustness in uncertainty estimation and accuracy in both point-wise and probabilistic estimation demonstrate its utility for real-world applications. This research fills a crucial gap in causal inference literature, providing a tool better suited for nonlinear treatment-outcome relationships and opening avenues for further exploration in the domain of continuous treatment effect estimation.

[LG-99] A Primer on Variational Inference for Physics-Informed Deep Generative Modelling

链接: https://arxiv.org/abs/2409.06560
作者: Alex Glyn-Davies,Arnaud Vadeboncoeur,O. Deniz Akyildiz,Ieva Kazlauskaite,Mark Girolami
关键词-EN: approximate Bayesian inference, Variational inference, Bayesian inference, computationally efficient, efficient and scalable
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Variational inference (VI) is a computationally efficient and scalable methodology for approximate Bayesian inference. It strikes a balance between accuracy of uncertainty quantification and practical tractability. It excels at generative modelling and inversion tasks due to its built-in Bayesian regularisation and flexibility, essential qualities for physics related problems. Deriving the central learning objective for VI must often be tailored to new learning tasks where the nature of the problems dictates the conditional dependence between variables of interest, such as arising in physics problems. In this paper, we provide an accessible and thorough technical introduction to VI for forward and inverse problems, guiding the reader through standard derivations of the VI framework and how it can best be realized through deep learning. We then review and unify recent literature exemplifying the creative flexibility allowed by VI. This paper is designed for a general scientific audience looking to solve physics-based problems with an emphasis on uncertainty quantification.

[LG-100] Deep Neural Networks: Multi-Classification and Universal Approximation

链接: https://arxiv.org/abs/2409.06555
作者: Martín Hernández,Enrique Zuazua
关键词-EN: ensuring accurate classification, finite sample memorization, achieve finite sample, deep neural network, ReLU deep neural
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We demonstrate that a ReLU deep neural network with a width of 2 and a depth of 2N+4M-1 layers can achieve finite sample memorization for any dataset comprising N elements in \mathbbR^d , where d\ge1, and M classes, thereby ensuring accurate classification. By modeling the neural network as a time-discrete nonlinear dynamical system, we interpret the memorization property as a problem of simultaneous or ensemble controllability. This problem is addressed by constructing the network parameters inductively and explicitly, bypassing the need for training or solving any optimization problem. Additionally, we establish that such a network can achieve universal approximation in L^p(\Omega;\mathbbR_+) , where \Omega is a bounded subset of \mathbbR^d and p\in[1,\infty) , using a ReLU deep neural network with a width of d+1 . We also provide depth estimates for approximating W^1,p functions and width estimates for approximating L^p(\Omega;\mathbbR^m) for m\geq1 . Our proofs are constructive, offering explicit values for the biases and weights involved. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC) MSC classes: 68T07, 93C10, 34H05 Cite as: arXiv:2409.06555 [stat.ML] (or arXiv:2409.06555v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2409.06555 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-101] Modelling Global Trade with Optimal Transport

链接: https://arxiv.org/abs/2409.06554
作者: Thomas Gaskin,Marie-Therese Wolfram,Andrew Duncan,Guven Demirel
关键词-EN: including tangible variables, supply and demand, including tangible, economic relations, complex mix
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Global trade is shaped by a complex mix of factors beyond supply and demand, including tangible variables like transport costs and tariffs, as well as less quantifiable influences such as political and economic relations. Traditionally, economists model trade using gravity models, which rely on explicit covariates but often struggle to capture these subtler drivers of trade. In this work, we employ optimal transport and a deep neural network to learn a time-dependent cost function from data, without imposing a specific functional form. This approach consistently outperforms traditional gravity models in accuracy while providing natural uncertainty quantification. Applying our framework to global food and agricultural trade, we show that the global South suffered disproportionately from the war in Ukraine’s impact on wheat markets. We also analyze the effects of free-trade agreements and trade disputes with China, as well as Brexit’s impact on British trade with Europe, uncovering hidden patterns that trade volumes alone cannot reveal.

[LG-102] Functionally Constrained Algorithm Solves Convex Simple Bilevel Problems

链接: https://arxiv.org/abs/2409.06530
作者: Huaqing Zhang,Lesi Chen,Jing Xu,Jingzhao Zhang
关键词-EN: convex upper-level function, paper studies simple, studies simple bilevel, convex lower-level problem, simple bilevel problems
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper studies simple bilevel problems, where a convex upper-level function is minimized over the optimal solutions of a convex lower-level problem. We first show the fundamental difficulty of simple bilevel problems, that the approximate optimal value of such problems is not obtainable by first-order zero-respecting algorithms. Then we follow recent works to pursue the weak approximate solutions. For this goal, we propose novel near-optimal methods for smooth and nonsmooth problems by reformulating them into functionally constrained problems.

[LG-103] Limit Order Book Simulation and Trade Evaluation with K-Nearest-Neighbor Resampling

链接: https://arxiv.org/abs/2409.06514
作者: Michael Giegrich,Roel Oomen,Christoph Reisinger
关键词-EN: off-policy evaluation method, evaluation method proposed, nearest neighbor, calibrate trading strategies, off-policy evaluation
类目: Trading and Market Microstructure (q-fin.TR); Machine Learning (cs.LG); Optimization and Control (math.OC); Statistical Finance (q-fin.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this paper, we show how K -nearest neighbor ( K -NN) resampling, an off-policy evaluation method proposed in \citegiegrich2023k, can be applied to simulate limit order book (LOB) markets and how it can be used to evaluate and calibrate trading strategies. Using historical LOB data, we demonstrate that our simulation method is capable of recreating realistic LOB dynamics and that synthetic trading within the simulation leads to a market impact in line with the corresponding literature. Compared to other statistical LOB simulation methods, our algorithm has theoretical convergence guarantees under general conditions, does not require optimization, is easy to implement and computationally efficient. Furthermore, we show that in a benchmark comparison our method outperforms a deep learning-based algorithm for several key statistics. In the context of a LOB with pro-rata type matching, we demonstrate how our algorithm can calibrate the size of limit orders for a liquidation strategy. Finally, we describe how K -NN resampling can be modified for choices of higher dimensional state spaces.

[LG-104] Learning local and semi-local density functionals from exact exchange-correlation potentials and energies

链接: https://arxiv.org/abs/2409.06498
作者: Bikash Kanungo,Jeffrey Hatch,Paul M. Zimmerman,Vikram Gavini
关键词-EN: Finding accurate exchange-correlation, density functional theory, Finding accurate, remains the defining, defining challenge
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Finding accurate exchange-correlation (XC) functionals remains the defining challenge in density functional theory (DFT). Despite 40 years of active development, the desired chemical accuracy is still elusive with existing functionals. We present a data-driven pathway to learn the XC functionals by utilizing the exact density, XC energy, and XC potential. While the exact densities are obtained from accurate configuration interaction (CI), the exact XC energies and XC potentials are obtained via inverse DFT calculations on the CI densities. We demonstrate how simple neural network (NN) based local density approximation (LDA) and generalized gradient approximation (GGA), trained on just five atoms and two molecules, provide remarkable improvement in total energies, densities, atomization energies, and barrier heights for hundreds of molecules outside the training set. Particularly, the NN-based GGA functional attains similar accuracy as the higher rung SCAN meta-GGA, highlighting the promise of using the XC potential in modeling XC functionals. We expect this approach to pave the way for systematic learning of increasingly accurate and sophisticated XC functionals.

[LG-105] Spectral Map for Slow Collective Variables Markovian Dynamics and Transition State Ensembles

链接: https://arxiv.org/abs/2409.06428
作者: Jakub Rydzewski
关键词-EN: Understanding the behavior, complex molecular systems, Phys. Chem. Lett, behavior of complex, complex molecular
类目: Chemical Physics (physics.chem-ph); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: Accepted as part of J. Chem. Theory Comput. special issue “Machine Learning and Statistical Mechanics: Shared Synergies for Next Generation of Chemical Theory and Computation.”

点击查看摘要

Abstract:Understanding the behavior of complex molecular systems is a fundamental problem in physical chemistry. To describe the long-time dynamics of such systems, which is responsible for their most informative characteristics, we can identify a few slow collective variables (CVs) while treating the remaining fast variables as thermal noise. This enables us to simplify the dynamics and treat it as diffusion in a free-energy landscape spanned by slow CVs, effectively rendering the dynamics Markovian. Our recent statistical learning technique, spectral map [Rydzewski, J. Phys. Chem. Lett. 2023, 14, 22, 5216-5220], explores this strategy to learn slow CVs by maximizing a spectral gap of a transition matrix. In this work, we introduce several advancements into our framework, using a high-dimensional reversible folding process of a protein as an example. We implement an algorithm for coarse-graining Markov transition matrices to partition the reduced space of slow CVs kinetically and use it to define a transition state ensemble. We show that slow CVs learned by spectral map closely approach the Markovian limit for an overdamped diffusion. We demonstrate that coordinate-dependent diffusion coefficients only slightly affect the constructed free-energy landscapes. Finally, we present how spectral map can be used to quantify the importance of features and compare slow CVs with structural descriptors commonly used in protein folding. Overall, we demonstrate that a single slow CV learned by spectral map can be used as a physical reaction coordinate to capture essential characteristics of protein folding.

[LG-106] Modified Meta-Thompson Sampling for Linear Bandits and Its Bayes Regret Analysis

链接: https://arxiv.org/abs/2409.06329
作者: Hao Li,Dong Liang,Zheng Xie
关键词-EN: Meta-learning is characterized, ability to learn, enabling the adaptation, adaptation of learning, learning strategies
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Meta-learning is characterized by its ability to learn how to learn, enabling the adaptation of learning strategies across different tasks. Recent research introduced the Meta-Thompson Sampling (Meta-TS), which meta-learns an unknown prior distribution sampled from a meta-prior by interacting with bandit instances drawn from it. However, its analysis was limited to Gaussian bandit. The contextual multi-armed bandit framework is an extension of the Gaussian Bandit, which challenges agent to utilize context vectors to predict the most valuable arms, optimally balancing exploration and exploitation to minimize regret over time. This paper introduces Meta-TSLB algorithm, a modified Meta-TS for linear contextual bandits. We theoretically analyze Meta-TSLB and derive an O\left( \left( m+\log \left( m \right) \right) \sqrtn\log \left( n \right) \right) bound on its Bayes regret, in which m represents the number of bandit instances, and n the number of rounds of Thompson Sampling. Additionally, our work complements the analysis of Meta-TS for linear contextual bandits. The performance of Meta-TSLB is evaluated experimentally under different settings, and we experimente and analyze the generalization capability of Meta-TSLB, showcasing its potential to adapt to unseen instances.

[LG-107] Automate Strategy Finding with LLM in Quant investment

链接: https://arxiv.org/abs/2409.06289
作者: Zhizhuo Kou,Holam Yu,Jingshu Peng,Lei Chen
关键词-EN: Large Language Models, high uncertainty, hindering their practical, practical application, Language Models
类目: Portfolio Management (q-fin.PM); Machine Learning (cs.LG); Pricing of Securities (q-fin.PR)
*备注:

点击查看摘要

Abstract:Despite significant progress in deep learning for financial trading, existing models often face instability and high uncertainty, hindering their practical application. Leveraging advancements in Large Language Models (LLMs) and multi-agent architectures, we propose a novel framework for quantitative stock investment in portfolio management and alpha mining. Our framework addresses these issues by integrating LLMs to generate diversified alphas and employing a multi-agent approach to dynamically evaluate market conditions. This paper proposes a framework where large language models (LLMs) mine alpha factors from multimodal financial data, ensuring a comprehensive understanding of market dynamics. The first module extracts predictive signals by integrating numerical data, research papers, and visual charts. The second module uses ensemble learning to construct a diverse pool of trading agents with varying risk preferences, enhancing strategy performance through a broader market analysis. In the third module, a dynamic weight-gating mechanism selects and assigns weights to the most relevant agents based on real-time market conditions, enabling the creation of an adaptive and context-aware composite alpha formula. Extensive experiments on the Chinese stock markets demonstrate that this framework significantly outperforms state-of-the-art baselines across multiple financial metrics. The results underscore the efficacy of combining LLM-generated alphas with a multi-agent architecture to achieve superior trading performance and stability. This work highlights the potential of AI-driven approaches in enhancing quantitative investment strategies and sets a new benchmark for integrating advanced machine learning techniques in financial trading can also be applied on diverse markets.

[LG-108] A new paradigm for global sensitivity analysis

链接: https://arxiv.org/abs/2409.06271
作者: Gildas Mazo(MaIAGE)
关键词-EN: nonlinear functional ANOVA, mutually independent-and leads, global sensitivity analysis, functional ANOVA decomposition, ANOVA decomposition
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:divpCurrent theory of global sensitivity analysis, based on a nonlinear functional ANOVA decomposition of the random output, is limited in scope-for instance, the analysis is limited to the output’s variance and the inputs have to be mutually independent-and leads to sensitivity indices the interpretation of which is not fully clear, especially interaction effects. Alternatively, sensitivity indices built for arbitrary user-defined importance measures have been proposed but a theory to define interactions in a systematic fashion and/or establish a decomposition of the total importance measure is still missing. It is shown that these important problems are solved all at once by adopting a new paradigm. By partitioning the inputs into those causing the change in the output and those which do not, arbitrary user-defined variability measures are identified with the outcomes of a factorial experiment at two levels, leading to all factorial effects without assuming any functional decomposition. To link various well-known sensitivity indices of the literature (Sobol indices and Shapley effects), weighted factorial effects are studied and utilized./p/div

[LG-109] Multi-Source Music Generation with Latent Diffusion ICASSP2025

链接: https://arxiv.org/abs/2409.06190
作者: Zhongweiyang Xu,Debottam Dutta,Yu-Lin Wei,Romit Roy Choudhury
关键词-EN: music, Diffusion Model, Diffusion, Model, MSDM
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: ICASSP 2025 in Submission

点击查看摘要

Abstract:Most music generation models directly generate a single music mixture. To allow for more flexible and controllable generation, the Multi-Source Diffusion Model (MSDM) has been proposed to model music as a mixture of multiple instrumental sources (e.g., piano, drums, bass, and guitar). Its goal is to use one single diffusion model to generate consistent music sources, which are further mixed to form the music. Despite its capabilities, MSDM is unable to generate songs with rich melodies and often generates empty sounds. Also, its waveform diffusion introduces significant Gaussian noise artifacts, which compromises audio quality. In response, we introduce a multi-source latent diffusion model (MSLDM) that employs Variational Autoencoders (VAEs) to encode each instrumental source into a distinct latent representation. By training a VAE on all music sources, we efficiently capture each source’s unique characteristics in a source latent that our diffusion model models jointly. This approach significantly enhances the total and partial generation of music by leveraging the VAE’s latent compression and noise-robustness. The compressed source latent also facilitates more efficient generation. Subjective listening tests and Frechet Audio Distance (FAD) scores confirm that our model outperforms MSDM, showcasing its practical and enhanced applicability in music generation systems. We also emphasize that modeling sources is more effective than direct music mixture modeling. Codes and models are available at this https URL. Demos are available at this https URL.

[LG-110] Variational Search Distributions

链接: https://arxiv.org/abs/2409.06142
作者: Daniel M. Steinberg,Rafael Oliveira,Cheng Soon Ong,Edwin V. Bonilla
关键词-EN: fixed experimental budget, rare desired class, batch sequential manner, variational search distributions, develop variational search
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 16 pages, 5 figures, Appendix material included

点击查看摘要

Abstract:We develop variational search distributions (VSD), a method for finding discrete, combinatorial designs of a rare desired class in a batch sequential manner with a fixed experimental budget. We formalize the requirements and desiderata for this problem and formulate a solution via variational inference that fulfill these. In particular, VSD uses off-the-shelf gradient based optimization routines, and can take advantage of scalable predictive models. We show that VSD can outperform existing baseline methods on a set of real sequence-design problems in various biological systems.

[LG-111] Regression with Large Language Models for Materials and Molecular Property Prediction

链接: https://arxiv.org/abs/2409.06080
作者: Ryan Jacobs,Maciej P. Polak,Lane E. Schultz,Hamed Mahdavi,Vasant Honavar,Dane Morgan
关键词-EN: large language models, Language Model Meta, large language, demonstrate the ability, significant deviation
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We demonstrate the ability of large language models (LLMs) to perform material and molecular property regression tasks, a significant deviation from the conventional LLM use case. We benchmark the Large Language Model Meta AI (LLaMA) 3 on several molecular properties in the QM9 dataset and 24 materials properties. Only composition-based input strings are used as the model input and we fine tune on only the generative loss. We broadly find that LLaMA 3, when fine-tuned using the SMILES representation of molecules, provides useful regression results which can rival standard materials property prediction models like random forest or fully connected neural networks on the QM9 dataset. Not surprisingly, LLaMA 3 errors are 5-10x higher than those of the state-of-the-art models that were trained using far more granular representation of molecules (e.g., atom types and their coordinates) for the same task. Interestingly, LLaMA 3 provides improved predictions compared to GPT-3.5 and GPT-4o. This work highlights the versatility of LLMs, suggesting that LLM-like generative models can potentially transcend their traditional applications to tackle complex physical phenomena, thus paving the way for future research and applications in chemistry, materials science and other scientific domains.

[LG-112] Bridging Rested and Restless Bandits with Graph-Triggering: Rising and Rotting

链接: https://arxiv.org/abs/2409.05980
作者: Gianmarco Genalti,Marco Mussi,Nicola Gatti,Marcello Restelli,Matteo Castiglioni,Alberto Maria Metelli
关键词-EN: real-world sequential decision-making, Rested and Restless, Restless Bandits, model real-world sequential, well-known bandit settings
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Rested and Restless Bandits are two well-known bandit settings that are useful to model real-world sequential decision-making problems in which the expected reward of an arm evolves over time due to the actions we perform or due to the nature. In this work, we propose Graph-Triggered Bandits (GTBs), a unifying framework to generalize and extend rested and restless bandits. In this setting, the evolution of the arms’ expected rewards is governed by a graph defined over the arms. An edge connecting a pair of arms (i,j) represents the fact that a pull of arm i triggers the evolution of arm j , and vice versa. Interestingly, rested and restless bandits are both special cases of our model for some suitable (degenerated) graph. As relevant case studies for this setting, we focus on two specific types of monotonic bandits: rising, where the expected reward of an arm grows as the number of triggers increases, and rotting, where the opposite behavior occurs. For these cases, we study the optimal policies. We provide suitable algorithms for all scenarios and discuss their theoretical guarantees, highlighting the complexity of the learning problem concerning instance-dependent terms that encode specific properties of the underlying graph structure.

[LG-113] DeepFM-Crispr: Prediction of CRISPR On-Target Effects via Deep Learning ICML

点击查看摘要

[LG-114] Hierarchical novel class discovery for single-cell transcriptomic profiles

链接: https://arxiv.org/abs/2409.05937
作者: Malek Senoussi,Thierry Artières,Paul Villoutreix
关键词-EN: major challenges arising, single-cell transcriptomic profiles, single-cell transcriptomics experiments, major challenges, challenges arising
类目: Genomics (q-bio.GN); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 11 pages, 1 figures, 2 tables

点击查看摘要

Abstract:One of the major challenges arising from single-cell transcriptomics experiments is the question of how to annotate the associated single-cell transcriptomic profiles. Because of the large size and the high dimensionality of the data, automated methods for annotation are needed. We focus here on datasets obtained in the context of developmental biology, where the differentiation process leads to a hierarchical structure. We consider a frequent setting where both labeled and unlabeled data are available at training time, but the sets of the labels of labeled data on one side and of the unlabeled data on the other side, are disjoint. It is an instance of the Novel Class Discovery problem. The goal is to achieve two objectives, clustering the data and mapping the clusters with labels. We propose extensions of k-Means and GMM clustering methods for solving the problem and report comparative results on artificial and experimental transcriptomic datasets. Our approaches take advantage of the hierarchical nature of the data.

[LG-115] Unlocking Potential Binders: Multimodal Pretraining DEL-Fusion for Denoising DNA-Encoded Libraries

点击查看摘要

[LG-116] Property Neurons in Self-Supervised Speech Transformers

点击查看摘要

[LG-117] In-ear ECG Signal Enhancement with Denoising Convolutional Autoencoders

链接: https://arxiv.org/abs/2409.05891
作者: Edoardo Occhipinti,Marek Zylinski,Harry J. Davies,Amir Nassibi,Matteo Bermond,Patrik Bachtiger,Nicholas S. Peters,Danilo P. Mandic
关键词-EN: consumer wearable electronics, wearable electronics, in-ear ECG recordings, shown to propagate, common site
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 7 pages, 9 figures

点击查看摘要

Abstract:The cardiac dipole has been shown to propagate to the ears, now a common site for consumer wearable electronics, enabling the recording of electrocardiogram (ECG) signals. However, in-ear ECG recordings often suffer from significant noise due to their small amplitude and the presence of other physiological signals, such as electroencephalogram (EEG), which complicates the extraction of cardiovascular features. This study addresses this issue by developing a denoising convolutional autoencoder (DCAE) to enhance ECG information from in-ear recordings, producing cleaner ECG outputs. The model is evaluated using a dataset of in-ear ECGs and corresponding clean Lead I ECGs from 45 healthy participants. The results demonstrate a substantial improvement in signal-to-noise ratio (SNR), with a median increase of 5.9 dB. Additionally, the model significantly improved heart rate estimation accuracy, reducing the mean absolute error by almost 70% and increasing R-peak detection precision to a median value of 90%. We also trained and validated the model using a synthetic dataset, generated from real ECG signals, including abnormal cardiac morphologies, corrupted by pink noise. The results obtained show effective removal of noise sources with clinically plausible waveform reconstruction ability.

[LG-118] Syntax-Guided Procedural Synthesis of Molecules

链接: https://arxiv.org/abs/2409.05873
作者: Michael Sun,Alston Lo,Wenhao Gao,Minghao Guo,Veronika Thost,Jie Chen,Connor Coley,Wojciech Matusik
关键词-EN: Designing synthetically accessible, Designing synthetically, accelerating molecular discovery, synthetically accessible molecules, synthetically accessible
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Designing synthetically accessible molecules and recommending analogs to unsynthesizable molecules are important problems for accelerating molecular discovery. We reconceptualize both problems using ideas from program synthesis. Drawing inspiration from syntax-guided synthesis approaches, we decouple the syntactic skeleton from the semantics of a synthetic tree to create a bilevel framework for reasoning about the combinatorial space of synthesis pathways. Given a molecule we aim to generate analogs for, we iteratively refine its skeletal characteristics via Markov Chain Monte Carlo simulations over the space of syntactic skeletons. Given a black-box oracle to optimize, we formulate a joint design space over syntactic templates and molecular descriptors and introduce evolutionary algorithms that optimize both syntactic and semantic dimensions synergistically. Our key insight is that once the syntactic skeleton is set, we can amortize over the search complexity of deriving the program’s semantics by training policies to fully utilize the fixed horizon Markov Decision Process imposed by the syntactic template. We demonstrate performance advantages of our bilevel framework for synthesizable analog generation and synthesizable molecule design. Notably, our approach offers the user explicit control over the resources required to perform synthesis and biases the design space towards simpler solutions, making it particularly promising for autonomous synthesis platforms.

[LG-119] Surface Flux Transport Modelling using Physics Informed Neural Networks

链接: https://arxiv.org/abs/2409.01744
作者: Jithu J Athalathil,Bhargav Vaidya,Sayan Kundu,Vishal Upendran,Mark C. M. Cheung
关键词-EN: turn shape space, shape space weather, magnetic flux transport, Surface Flux Transport, Flux Transport
类目: olar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注: 21 pages, 10 figures

点击查看摘要

Abstract:Studying the magnetic field properties on the solar surface is crucial for understanding the solar and heliospheric activities, which in turn shape space weather in the solar system. Surface Flux Transport (SFT) modelling helps us to simulate and analyse the transport and evolution of magnetic flux on the solar surface, providing valuable insights into the mechanisms responsible for solar activity. In this work, we demonstrate the use of machine learning techniques in solving magnetic flux transport, making it accurate. We have developed a novel Physics-Informed Neural Networks (PINNs)-based model to study the evolution of Bipolar Magnetic Regions (BMRs) using SFT in one-dimensional azimuthally averaged and also in two-dimensions. We demonstrate the efficiency and computational feasibility of our PINNs-based model by comparing its performance and accuracy with that of a numerical model implemented using the Runge-Kutta Implicit-Explicit (RK-IMEX) scheme. The mesh-independent PINNs method can be used to reproduce the observed polar magnetic field with better flux conservation. This advancement is important for accurately reproducing observed polar magnetic fields, thereby providing insights into the strength of future solar cycles. This work paves the way for more efficient and accurate simulations of solar magnetic flux transport and showcases the applicability of PINNs in solving advection-diffusion equations with a particular focus on heliophysics.

信息检索

[IR-0] Critical Features Tracking on Triangulated Irregular Networks by a Scale-Space Method

链接: https://arxiv.org/abs/2409.06638
作者: Haoan Feng,Yunting Song,Leila De Floriani
关键词-EN: visual reasoning, input signal, well-established framework, framework that constructs, constructs a hierarchical
类目: Information Retrieval (cs.IR)
*备注: 13pages, ACM SIGSPATIAL 2024

点击查看摘要

Abstract:The scale-space method is a well-established framework that constructs a hierarchical representation of an input signal and facilitates coarse-to-fine visual reasoning. Considering the terrain elevation function as the input signal, the scale-space method can identify and track significant topographic features across different scales. The number of scales a feature persists, called its life span, indicates the importance of that feature. In this way, important topographic features of a landscape can be selected, which are useful for many applications, including cartography, nautical charting, and land-use planning. The scale-space methods developed for terrain data use gridded Digital Elevation Models (DEMs) to represent the terrain. However, gridded DEMs lack the flexibility to adapt to the irregular distribution of input data and the varied topological complexity of different regions. Instead, Triangulated Irregular Networks (TINs) can be directly generated from irregularly distributed point clouds and accurately preserve important features. In this work, we introduce a novel scale-space analysis pipeline for TINs, addressing the multiple challenges in extending grid-based scale-space methods to TINs. Our pipeline can efficiently identify and track topologically important features on TINs. Moreover, it is capable of analyzing terrains with irregular boundaries, which poses challenges for grid-based methods. Comprehensive experiments show that, compared to grid-based methods, our TIN-based pipeline is more efficient, accurate, and has better resolution robustness.

[IR-1] Operational Advice for Dense and Sparse Retrievers: HNSW Flat or Inverted Indexes?

链接: https://arxiv.org/abs/2409.06464
作者: Jimmy Lin
关键词-EN: face a bewildering, bewildering number, retrieval today face, Practitioners working, today face
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Practitioners working on dense retrieval today face a bewildering number of choices. Beyond selecting the embedding model, another consequential choice is the actual implementation of nearest-neighbor vector search. While best practices recommend HNSW indexes, flat vector indexes with brute-force search represent another viable option, particularly for smaller corpora and for rapid prototyping. In this paper, we provide experimental results on the BEIR dataset using the open-source Lucene search library that explicate the tradeoffs between HNSW and flat indexes (including quantized variants) from the perspectives of indexing time, query evaluation performance, and retrieval quality. With additional comparisons between dense and sparse retrievers, our results provide guidance for today’s search practitioner in understanding the design space of dense and sparse retrievers. To our knowledge, we are the first to provide operational advice supported by empirical experiments in this regard.

[IR-2] Enhancing Sequential Recommendations through Multi-Perspective Reflections and Iteration

链接: https://arxiv.org/abs/2409.06377
作者: Weicong Qin,Yi Xu,Weijie Yu,Chenglei Shen,Xiao Zhang,Ming He,Jianping Fan,Jun Xu
关键词-EN: understanding user intentions, Sequence recommendation, collaborative filtering information, aims to predict, leveraging collaborative filtering
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注: First 3 authors contributes equally to this work

点击查看摘要

[IR-3] HierLLM: Hierarchical Large Language Model for Question Recommendation

链接: https://arxiv.org/abs/2409.06177
作者: Yuxuan Liu,Haipeng Liu,Ting Long
关键词-EN: sequentially recommends questions, Question, learning history, learning, Question recommendation
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Question recommendation is a task that sequentially recommends questions for students to enhance their learning efficiency. That is, given the learning history and learning target of a student, a question recommender is supposed to select the question that will bring the most improvement for students. Previous methods typically model the question recommendation as a sequential decision-making problem, estimating students’ learning state with the learning history, and feeding the learning state with the learning target to a neural network to select the recommended question from a question set. However, previous methods are faced with two challenges: (1) learning history is unavailable in the cold start scenario, which makes the recommender generate inappropriate recommendations; (2) the size of the question set is much large, which makes it difficult for the recommender to select the best question precisely. To address the challenges, we propose a method called hierarchical large language model for question recommendation (HierLLM), which is a LLM-based hierarchical structure. The LLM-based structure enables HierLLM to tackle the cold start issue with the strong reasoning abilities of LLM. The hierarchical structure takes advantage of the fact that the number of concepts is significantly smaller than the number of questions, narrowing the range of selectable questions by first identifying the relevant concept for the to-recommend question, and then selecting the recommended question based on that concept. This hierarchical structure reduces the difficulty of the this http URL investigate the performance of HierLLM, we conduct extensive experiments, and the results demonstrate the outstanding performance of HierLLM.

[IR-4] What makes a good concept anyway ?

链接: https://arxiv.org/abs/2409.06150
作者: Naren Khatwani,James Geller
关键词-EN: completely and correctly, expected to cover, cover its domain, domain completely, hard
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:A good medical ontology is expected to cover its domain completely and correctly. On the other hand, large ontologies are hard to build, hard to understand, and hard to maintain. Thus, adding new concepts (often multi-word concepts) to an existing ontology must be done judiciously. Only “good” concepts should be added; however, it is difficult to define what makes a concept good. In this research, we propose a metric to measure the goodness of a concept. We identified factors that appear to influence goodness judgments of medical experts and combined them into a single metric. These factors include concept name length (in words), concept occurrence frequency in the medical literature, and syntactic categories of component words. As an added factor, we used the simplicity of a term after mapping it into a specific foreign language. We performed Bayesian optimization of factor weights to achieve maximum agreement between the metric and three medical experts. The results showed that our metric had a 50.67% overall agreement with the experts, as measured by Krippendorff’s alpha.

[IR-5] Latent Diffusion Bridges for Unsupervised Musical Audio Timbre Transfer

点击查看摘要

[IR-6] Assessing SPARQL capabilities of Large Language Models

链接: https://arxiv.org/abs/2409.05925
作者: Lars-Peter Meyer,Johannes Frey,Felix Brei,Natanael Arndt
关键词-EN: Large Language Models, offers significant synergistic, significant synergistic potential, SPARQL SELECT queries, Large Language
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: peer reviewed publication at NLP4KGc @ Semantics 2024, see this https URL

点击查看摘要

[IR-7] LexBoost: Improving Lexical Document Retrieval with Nearest Neighbors

链接: https://arxiv.org/abs/2409.05882
作者: Hrishikesh Kulkarni,Nazli Goharian,Ophir Frieder,Sean MacAvaney
关键词-EN: surface form, dense retrieval, dense, retrieval, methods
类目: Information Retrieval (cs.IR)
*备注: ACM DocEng 2024

点击查看摘要

Abstract:Sparse retrieval methods like BM25 are based on lexical overlap, focusing on the surface form of the terms that appear in the query and the document. The use of inverted indices in these methods leads to high retrieval efficiency. On the other hand, dense retrieval methods are based on learned dense vectors and, consequently, are effective but comparatively slow. Since sparse and dense methods approach problems differently and use complementary relevance signals, approximation methods were proposed to balance effectiveness and efficiency. For efficiency, approximation methods like HNSW are frequently used to approximate exhaustive dense retrieval. However, approximation techniques still exhibit considerably higher latency than sparse approaches. We propose LexBoost that first builds a network of dense neighbors (a corpus graph) using a dense retrieval approach while indexing. Then, during retrieval, we consider both a document’s lexical relevance scores and its neighbors’ scores to rank the documents. In LexBoost this remarkably simple application of the Cluster Hypothesis contributes to stronger ranking effectiveness while contributing little computational overhead (since the corpus graph is constructed offline). The method is robust across the number of neighbors considered, various fusion parameters for determining the scores, and different dataset construction methods. We also show that re-ranking on top of LexBoost outperforms traditional dense re-ranking and leads to results comparable with higher-latency exhaustive dense retrieval.

[IR-8] CF-KAN: Kolmogorov-Arnold Network-based Collaborative Filtering to Mitigate Catastrophic Forgetting in Recommender Systems

点击查看摘要

[IR-9] CSRec: Rethinking Sequential Recommendation from A Causal Perspective

点击查看摘要

[IR-10] FairEvalLLM. A Comprehensive Framework for Benchmarking Fairness in Large Language Model Recommender Systems

点击查看摘要

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-09-11

目录

概览 (2024-09-11)

自然语言处理

人工智能

计算机视觉

机器学习

信息检索

附件下载