本篇博文主要展示 2024-09-30 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上11:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2024-09-30)

今日共更新426篇论文,其中:

  • 自然语言处理53篇(Computation and Language (cs.CL))
  • 人工智能116篇(Artificial Intelligence (cs.AI))
  • 计算机视觉105篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习137篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] LML: Language Model Learning a Dataset for Data-Augmented Prediction

【速读】: 该论文试图解决传统机器学习模型在分类任务中依赖大量数据清洗和特征工程的问题,提出了一种基于大型语言模型(LLMs)的新方法——“语言模型学习(LML)”,其核心在于“数据增强预测(DAP)”。DAP通过利用数据摘要和相关数据行自动生成查询,并由LLM进行分类,实现了类似人类手动探索和理解数据的过程,确保了在复杂数据情况下的高准确性。该方法通过在提示中使用“Act as an Explainable Machine Learning Model”来增强预测的可解释性,并在某些测试案例中达到了超过90%的准确率,显示出其优于传统机器学习模型的潜力。

链接: https://arxiv.org/abs/2409.18957
作者: Praneeth Vadlapati
关键词-EN: Large Language Models, Language Model Learning, Large Language, Machine Learning Model, Explainable Machine Learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: First version

点击查看摘要

Abstract:This paper introduces a new approach to using Large Language Models (LLMs) for classification tasks, which are typically handled using Machine Learning (ML) models. Unlike ML models that rely heavily on data cleaning and feature engineering, this method streamlines the process using LLMs. This paper proposes a new concept called “Language Model Learning (LML)” powered by a new method called “Data-Augmented Prediction (DAP)”. The classification is performed by LLMs using a method similar to humans manually exploring and understanding the data and deciding classifications using data as a reference. Training data is summarized and evaluated to determine the features that lead to the classification of each label the most. In the process of DAP, the system uses the data summary to automatically create a query, which is used to retrieve relevant rows from the dataset. A classification is generated by the LLM using data summary and relevant rows, ensuring satisfactory accuracy even with complex data. Usage of data summary and similar data in DAP ensures context-aware decision-making. The proposed method uses the words “Act as an Explainable Machine Learning Model” in the prompt to enhance the interpretability of the predictions by allowing users to review the logic behind each prediction. In some test cases, the system scored an accuracy above 90%, proving the effectiveness of the system and its potential to outperform conventional ML models in various scenarios. The code is available at this https URL
摘要:本文介绍了一种利用大语言模型 (LLM) 进行分类任务的新方法,这些任务通常由机器学习 (ML) 模型处理。与依赖大量数据清洗和特征工程的 ML 模型不同,该方法通过 LLM 简化了流程。本文提出了一种名为“语言模型学习 (LML)”的新概念,该概念由一种名为“数据增强预测 (DAP)”的新方法驱动。分类过程通过 LLM 进行,类似于人类手动探索和理解数据,并使用数据作为参考进行分类决策。训练数据被总结和评估,以确定导致每个标签分类的最显著特征。在 DAP 过程中,系统使用数据摘要自动创建查询,用于从数据集中检索相关行。LLM 使用数据摘要和相关行生成分类,确保即使在复杂数据下也能达到令人满意的准确性。DAP 中使用数据摘要和相似数据确保了上下文感知的决策制定。所提出的方法在提示中使用“充当可解释的机器学习模型”一词,通过允许用户审查每个预测背后的逻辑来增强预测的可解释性。在某些测试案例中,系统准确率超过 90%,证明了该系统的有效性及其在各种场景中超越传统 ML 模型的潜力。代码可在以下链接获取:https URL

[NLP-1] Ruler: A Model-Agnostic Method to Control Generated Length for Large Language Models

【速读】: 该论文试图解决大型语言模型在生成特定长度响应时难以准确满足用户需求的问题。解决方案的关键在于提出了一个名为Ruler的新方法,该方法通过引入Meta Length Tokens (MLTs)来增强模型在长度约束指令下的遵循能力。Ruler不仅使模型能够根据指令中的长度约束生成指定长度的响应,还能在没有明确长度约束时自动生成合适的MLT,从而展示了其卓越的通用性和泛化能力。实验结果表明,Ruler在不同的大型语言模型上均有效提升了在目标长度生成任务中的表现。

链接: https://arxiv.org/abs/2409.18943
作者: Jiaming Li,Lei Zhang,Yunshui Li,Ziqiang Liu,yuelin bai,Run Luo,Longze Chen,Min Yang
关键词-EN: large language models, large language, Length Generation Task, models enables humans, language models enables
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The instruction-following ability of large language models enables humans to interact with AI agents in a natural way. However, when required to generate responses of a specific length, large language models often struggle to meet users’ needs due to their inherent difficulty in accurately perceiving numerical constraints. To explore the ability of large language models to control the length of generated responses, we propose the Target Length Generation Task (TLG) and design two metrics, Precise Match (PM) and Flexible Match (FM) to evaluate the model’s performance in adhering to specified response lengths. Furthermore, we introduce a novel, model-agnostic approach called Ruler, which employs Meta Length Tokens (MLTs) to enhance the instruction-following ability of large language models under length-constrained instructions. Specifically, Ruler equips LLMs with the ability to generate responses of a specified length based on length constraints within the instructions. Moreover, Ruler can automatically generate appropriate MLT when length constraints are not explicitly provided, demonstrating excellent versatility and generalization. Comprehensive experiments show the effectiveness of Ruler across different LLMs on Target Length Generation Task, e.g., at All Level 27.97 average gain on PM, 29.57 average gain on FM. In addition, we conduct extensive ablation experiments to further substantiate the efficacy and generalization of Ruler. Our code and data is available at this https URL.
摘要:大语言模型的指令遵循能力使得人类能够以自然的方式与 AI 智能体进行交互。然而,当需要生成特定长度的响应时,大语言模型往往难以满足用户的需求,因为它们在准确感知数值约束方面存在固有的困难。为了探索大语言模型在控制生成响应长度方面的能力,我们提出了目标长度生成任务 (Target Length Generation Task, TLG),并设计了两种评估指标:精确匹配 (Precise Match, PM) 和灵活匹配 (Flexible Match, FM),以评估模型在遵守指定响应长度方面的表现。此外,我们引入了一种新颖的、与模型无关的方法,称为 Ruler,该方法利用元长度 Token (Meta Length Tokens, MLTs) 来增强大语言模型在长度约束指令下的指令遵循能力。具体而言,Ruler 使大语言模型能够根据指令中的长度约束生成指定长度的响应。此外,当长度约束未明确提供时,Ruler 能够自动生成适当的 MLT,展示了其卓越的通用性和泛化能力。全面的实验表明,Ruler 在目标长度生成任务上对不同大语言模型的有效性,例如,在 PM 上平均获得 27.97 的增益,在 FM 上平均获得 29.57 的增益。此外,我们还进行了广泛的消融实验,以进一步证实 Ruler 的有效性和泛化能力。我们的代码和数据可在以下链接获取:https URL。

[NLP-2] AIPatient: Simulating Patients with EHRs and LLM Powered Agent ic Workflow

【速读】: 该论文试图解决模拟患者系统在医学教育和研究中的有效性和可信度问题。解决方案的关键在于开发了AIPatient系统,该系统以AIPatient知识图谱(AIPatient KG)为输入,并采用推理检索增强生成(Reasoning RAG)代理工作流程作为生成核心。AIPatient KG从MIMIC-III数据库的电子健康记录(EHRs)中采样数据,生成具有高知识库有效性的临床多样且相关的1,495名患者群体(F1 0.89)。Reasoning RAG利用六个由大型语言模型(LLM)驱动的代理,涵盖检索、知识图谱查询生成、抽象、检查、重写和总结等任务,实现了在基于EHR的医学问答(QA)中高达94.15%的准确率,超越了不使用或仅部分使用代理集成的基准。该系统还表现出高可读性、鲁棒性和稳定性,显示出其在医学教育、模型评估和系统集成中的广泛应用潜力。

链接: https://arxiv.org/abs/2409.18924
作者: Huizi Yu,Jiayan Zhou,Lingyao Li,Shan Chen,Jack Gallifant,Anye Shi,Xiang Li,Wenyue Hua,Mingyu Jin,Guang Chen,Yang Zhou,Zhao Li,Trisha Gupte,Ming-Li Chen,Zahra Azizi,Yongfeng Zhang,Themistocles L. Assimes,Xin Ma,Danielle S. Bitterman,Lin Lu,Lizhou Fan
关键词-EN: integrative learning environments, clinical decision-making simulations, enabling clinical decision-making, Simulated patient systems, Simulated patient
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 42 pages, 6 figures, 7 tables

点击查看摘要

Abstract:Simulated patient systems play a crucial role in modern medical education and research, providing safe, integrative learning environments and enabling clinical decision-making simulations. Large Language Models (LLM) could advance simulated patient systems by replicating medical conditions and patient-doctor interactions with high fidelity and low cost. However, ensuring the effectiveness and trustworthiness of these systems remains a challenge, as they require a large, diverse, and precise patient knowledgebase, along with a robust and stable knowledge diffusion to users. Here, we developed AIPatient, an advanced simulated patient system with AIPatient Knowledge Graph (AIPatient KG) as the input and the Reasoning Retrieval-Augmented Generation (Reasoning RAG) agentic workflow as the generation backbone. AIPatient KG samples data from Electronic Health Records (EHRs) in the Medical Information Mart for Intensive Care (MIMIC)-III database, producing a clinically diverse and relevant cohort of 1,495 patients with high knowledgebase validity (F1 0.89). Reasoning RAG leverages six LLM powered agents spanning tasks including retrieval, KG query generation, abstraction, checker, rewrite, and summarization. This agentic framework reaches an overall accuracy of 94.15% in EHR-based medical Question Answering (QA), outperforming benchmarks that use either no agent or only partial agent integration. Our system also presents high readability (median Flesch Reading Ease 77.23; median Flesch Kincaid Grade 5.6), robustness (ANOVA F-value 0.6126, p0.1), and stability (ANOVA F-value 0.782, p0.1). The promising performance of the AIPatient system highlights its potential to support a wide range of applications, including medical education, model evaluation, and system integration.
摘要:模拟患者系统在现代医学教育和研究中扮演着至关重要的角色,提供安全、综合的学习环境,并支持临床决策模拟。大语言模型 (LLM) 可以通过高保真度和低成本的方式复制医疗状况和医患互动,从而推动模拟患者系统的发展。然而,确保这些系统的有效性和可信度仍然是一个挑战,因为它们需要一个庞大、多样且精确的患者知识库,以及一个强大且稳定的知识传递机制。在此,我们开发了 AIPatient,这是一个先进的模拟患者系统,以 AIPatient 知识图谱 (AIPatient KG) 作为输入,以推理检索增强生成 (Reasoning RAG) 智能体工作流程作为生成骨干。AIPatient KG 从重症监护医学信息市场 (MIMIC)-III 数据库的电子健康记录 (EHR) 中抽取数据,生成一个临床多样且相关性强的 1,495 名患者队列,知识库的有效性高 (F1 0.89)。Reasoning RAG 利用六个由 LLM 驱动的智能体,涵盖检索、KG 查询生成、抽象、检查、重写和总结等任务。该智能体框架在基于 EHR 的医学问答 (QA) 中达到了 94.15% 的整体准确率,优于不使用智能体或仅部分集成智能体的基准。我们的系统还表现出高可读性 (中位数 Flesch Reading Ease 77.23; 中位数 Flesch Kincaid Grade 5.6)、鲁棒性 (ANOVA F-value 0.6126, p0.1) 和稳定性 (ANOVA F-value 0.782, p0.1)。AIPatient 系统的出色表现突显了其在医学教育、模型评估和系统集成等广泛应用中的潜力。

[NLP-3] Soft Measures for Extracting Causal Collective Intelligence EMNLP2024

【速读】: 该论文试图解决从文本中自动提取高完整性模糊认知图(FCMs)的问题,这是理解和建模集体智能以应对复杂社会系统的关键。解决方案的关键在于利用大型语言模型(LLMs)进行自动化FCM提取,并引入新的基于图的相似性度量方法,通过Elo评分系统与人类判断进行关联评估。尽管结果显示与人类评估有正相关,但现有度量方法在捕捉FCM细微差别方面仍存在局限,因此强调了开发针对FCM提取的软相似性度量的必要性。

链接: https://arxiv.org/abs/2409.18911
作者: Maryam Berijanian,Spencer Dork,Kuldeep Singh,Michael Riley Millikan,Ashlin Riggs,Aadarsh Swaminathan,Sarah L. Gibbs,Scott E. Friedman,Nathan Brugnone
关键词-EN: addressing complex social, complex social systems, essential for addressing, addressing complex, complex social
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注: Camera-ready version accepted for publication in the EMNLP 2024 Workshop NLP4Science

点击查看摘要

Abstract:Understanding and modeling collective intelligence is essential for addressing complex social systems. Directed graphs called fuzzy cognitive maps (FCMs) offer a powerful tool for encoding causal mental models, but extracting high-integrity FCMs from text is challenging. This study presents an approach using large language models (LLMs) to automate FCM extraction. We introduce novel graph-based similarity measures and evaluate them by correlating their outputs with human judgments through the Elo rating system. Results show positive correlations with human evaluations, but even the best-performing measure exhibits limitations in capturing FCM nuances. Fine-tuning LLMs improves performance, but existing measures still fall short. This study highlights the need for soft similarity measures tailored to FCM extraction, advancing collective intelligence modeling with NLP.
摘要:理解和建模集体智能对于解决复杂的社会系统至关重要。模糊认知图 (Fuzzy Cognitive Maps, FCMs) 作为一种有向图,提供了编码因果心理模型的强大工具,但从文本中提取高完整性的 FCMs 具有挑战性。本研究提出了一种利用大语言模型 (Large Language Models, LLMs) 来自动化提取 FCMs 的方法。我们引入了新的基于图的相似性度量,并通过 Elo 评分系统将其输出与人类判断进行关联来评估这些度量。结果显示与人类评估呈正相关,但即使是表现最好的度量方法在捕捉 FCM 细微差别方面也存在局限性。微调 LLMs 可以提高性能,但现有度量方法仍显不足。本研究强调了针对 FCM 提取定制软相似性度量的必要性,通过自然语言处理 (NLP) 推进集体智能建模。

[NLP-4] IDGen: Item Discrimination Induced Prompt Generation for LLM Evaluation NEURIPS2024

【速读】: 该论文试图解决大语言模型(LLMs)评估数据集的更新与优化问题,确保评估数据集能够持续反映模型能力的提升并有效区分不同模型的性能差异。解决方案的关键在于引入项目区分度(Item Discrimination, ID)理论,并提出一个基于ID的提示合成框架,该框架能够生成既广泛又具体的提示,全面评估LLMs的能力,并通过自校正机制和预测提示区分度与难度分数的模型,生成高质量、具有挑战性和区分度的评估数据。

链接: https://arxiv.org/abs/2409.18892
作者: Fan Lin,Shuyi Xie,Yong Dai,Wenlin Yao,Tianjiao Lang,Zishan Xu,Zhichao Hu,Xiao Xiao,Yuhong Liu,Yu Zhang
关键词-EN: Large Language Models, Large Language, grow increasingly adept, managing complex tasks, remains sufficiently discriminative
类目: Computation and Language (cs.CL)
备注: NeurIPS 2024

点击查看摘要

Abstract:As Large Language Models (LLMs) grow increasingly adept at managing complex tasks, the evaluation set must keep pace with these advancements to ensure it remains sufficiently discriminative. Item Discrimination (ID) theory, which is widely used in educational assessment, measures the ability of individual test items to differentiate between high and low performers. Inspired by this theory, we propose an ID-induced prompt synthesis framework for evaluating LLMs to ensure the evaluation set can continually update and refine according to model abilities. Our data synthesis framework prioritizes both breadth and specificity. It can generate prompts that comprehensively evaluate the capabilities of LLMs while revealing meaningful performance differences between models, allowing for effective discrimination of their relative strengths and weaknesses across various tasks and domains. To produce high-quality data, we incorporate a self-correct mechanism into our generalization framework, and develop two models to predict prompt discrimination and difficulty score to facilitate our data synthesis framework, contributing valuable tools to evaluation data synthesis research. We apply our generated data to evaluate five SOTA models. Our data achieves an average score of 51.92, accompanied by a variance of 10.06. By contrast, previous works (i.e., SELF-INSTRUCT and WizardLM) obtain an average score exceeding 67, with a variance below 3.2. The results demonstrate that the data generated by our framework is more challenging and discriminative compared to previous works. We will release a dataset of over 3,000 carefully crafted prompts to facilitate evaluation research of LLMs.
摘要:随着大语言模型 (LLMs) 在处理复杂任务方面日益精进,评估集必须跟上这些进步,以确保其保持足够的区分度。项目区分度 (Item Discrimination, ID) 理论在教育评估中广泛应用,用于衡量单个测试项目区分高低表现者的能力。受此理论启发,我们提出了一种基于 ID 的提示合成框架,用于评估 LLMs,以确保评估集能够根据模型能力持续更新和优化。我们的数据合成框架注重广度和特异性,能够生成全面评估 LLMs 能力的提示,同时揭示模型之间有意义的性能差异,从而在各种任务和领域中有效区分其相对优势和劣势。为生成高质量数据,我们在泛化框架中引入了自校正机制,并开发了两种模型来预测提示的区分度和难度评分,以促进我们的数据合成框架,为评估数据合成研究贡献了宝贵的工具。我们将生成的数据应用于评估五个 SOTA 模型,我们的数据平均得分为 51.92,方差为 10.06。相比之下,先前的工作 (如 SELF-INSTRUCT 和 WizardLM) 的平均得分超过 67,方差低于 3.2。结果表明,我们的框架生成的数据相比先前的工作更具挑战性和区分度。我们将发布一个包含超过 3,000 个精心设计的提示的数据集,以促进 LLMs 的评估研究。

[NLP-5] Suicide Phenotyping from Clinical Notes in Safety-Net Psychiatric Hospital Using Multi-Label Classification with Pre-Trained Language Models

【速读】: 该论文试图解决从非结构化的临床记录中准确识别和分类自杀相关事件的问题,以提高高危精神病治疗环境中的预防措施和护理质量。解决方案的关键在于使用预训练语言模型(如BERT及其变体)进行微调,特别是采用单多标签分类策略的RoBERTa模型表现最佳,其准确率和F1分数分别为0.88和0.81,表明基于领域相关数据预训练的模型和单多标签分类策略能显著提升识别效率和性能。

链接: https://arxiv.org/abs/2409.18878
作者: Zehan Li,Yan Hu,Scott Lane,Salih Selek,Lokesh Shahani,Rodrigo Machado-Vieira,Jair Soares,Hua Xu,Hongfang Liu,Ming Huang
关键词-EN: reducing operational burden, improving care quality, Accurate identification, high-acuity psychiatric settings, reducing operational
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注: submitted to AMIA Informatics Summit 2025 as a conference paper

点击查看摘要

Abstract:Accurate identification and categorization of suicidal events can yield better suicide precautions, reducing operational burden, and improving care quality in high-acuity psychiatric settings. Pre-trained language models offer promise for identifying suicidality from unstructured clinical narratives. We evaluated the performance of four BERT-based models using two fine-tuning strategies (multiple single-label and single multi-label) for detecting coexisting suicidal events from 500 annotated psychiatric evaluation notes. The notes were labeled for suicidal ideation (SI), suicide attempts (SA), exposure to suicide (ES), and non-suicidal self-injury (NSSI). RoBERTa outperformed other models using binary relevance (acc=0.86, F1=0.78). MentalBERT (F1=0.74) also exceeded BioClinicalBERT (F1=0.72). RoBERTa fine-tuned with a single multi-label classifier further improved performance (acc=0.88, F1=0.81), highlighting that models pre-trained on domain-relevant data and the single multi-label classification strategy enhance efficiency and performance. Keywords: EHR-based Phynotyping; Natural Language Processing; Secondary Use of EHR Data; Suicide Classification; BERT-based Model; Psychiatry; Mental Health Comments: submitted to AMIA Informatics Summit 2025 as a conference paper Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR) Cite as: arXiv:2409.18878 [cs.CL] (or arXiv:2409.18878v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.18878 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:准确识别和分类自杀事件可以提高自杀预防措施的效果,减轻高强度精神病治疗环境中的操作负担,并提升护理质量。预训练语言模型在从非结构化临床叙述中识别自杀倾向方面显示出潜力。我们评估了四种基于 BERT 的模型在使用两种微调策略(多单标签和单多标签)检测 500 份标注的精神评估笔记中同时存在的自杀事件时的表现。这些笔记被标注为自杀意念(SI)、自杀企图(SA)、自杀暴露(ES)和非自杀性自伤(NSSI)。RoBERTa 在使用二元相关性方法时表现优于其他模型(acc=0.86,F1=0.78)。MentalBERT(F1=0.74)也超过了 BioClinicalBERT(F1=0.72)。使用单多标签分类器微调的 RoBERTa 进一步提升了性能(acc=0.88,F1=0.81),这表明在领域相关数据上预训练的模型和单多标签分类策略能提高效率和性能。关键词:基于 EHR 的生理分型;自然语言处理;EHR 数据的二次利用;自杀分类;基于 BERT 的模型;精神病学;心理健康 评论:已提交至 AMIA 2025 年信息学峰会作为会议论文 主题:计算与语言(cs.CL);人工智能(cs.AI);计算机与社会(cs.CY);信息检索(cs.IR) 引用为:arXiv:2409.18878 [cs.CL] (或 arXiv:2409.18878v1 [cs.CL] 用于此版本) https://doi.org/10.48550/arXiv.2409.18878 聚焦以了解更多 arXiv 发布的 DOI 通过 DataCite(待注册)

[NLP-6] Individuation in Neural Models with and without Visual Grounding

【速读】: 该论文旨在探讨语言与视觉模型CLIP与纯文本模型FastText和SBERT在个体化信息编码上的差异。研究的关键在于,CLIP嵌入能够更好地捕捉个体化信息的数量差异,并且其推导出的个体化层次结构与语言学和认知科学中提出的层次结构相一致。这一发现表明,结合视觉信息的模型在处理个体化信息时具有显著优势。

链接: https://arxiv.org/abs/2409.18868
作者: Alexey Tikhonov,Lisa Bylinina,Ivan P. Yamshchikov
关键词-EN: FastText and SBERT, CLIP, SBERT, CLIP embeddings, individuation information
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We show differences between a language-and-vision model CLIP and two text-only models - FastText and SBERT - when it comes to the encoding of individuation information. We study latent representations that CLIP provides for substrates, granular aggregates, and various numbers of objects. We demonstrate that CLIP embeddings capture quantitative differences in individuation better than models trained on text-only data. Moreover, the individuation hierarchy we deduce from the CLIP embeddings agrees with the hierarchies proposed in linguistics and cognitive science.
摘要:我们展示了在编码个体化信息方面,语言与视觉模型 CLIP 与两种仅文本模型——FastText 和 SBERT 之间的差异。我们研究了 CLIP 为基质、颗粒聚集体以及各种数量对象提供的潜在表示。我们证明,CLIP 嵌入比仅基于文本数据训练的模型更好地捕捉了个体化中的定量差异。此外,我们从 CLIP 嵌入中推导出的个体化层次结构与语言学和认知科学中提出的层次结构相一致。

[NLP-7] Local Transcription Models in Home Care Nursing in Switzerland: an Interdisciplinary Case Study

【速读】: 该论文试图解决在瑞士家庭护理文档记录中,由于数据隐私、地方语言和方言以及领域特定词汇带来的挑战。解决方案的关键在于评估和实验不同的转录工具和模型,特别是OpenAI的Whisper模型,通过涉及德语的不同变体(如方言、外国口音)和由领域专家手动整理的示例文本,验证其性能。结果表明,即使使用现成的模型,也能为该领域的未来研究提供一个良好的起点。

链接: https://arxiv.org/abs/2409.18819
作者: Jeremy Kramer,Tetiana Kravchenko,Beatrice Kaufmann,Friederike J.S. Thilo,Mascha Kurpicz-Briki
关键词-EN: natural language processing, Latest advances, medical sector, home care nursing, NLP
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Latest advances in the field of natural language processing (NLP) enable new use cases for different domains, including the medical sector. In particular, transcription can be used to support automation in the nursing documentation process and give nurses more time to interact with the patients. However, different challenges including (a) data privacy, (b) local languages and dialects, and © domain-specific vocabulary need to be addressed. In this case study, we investigate the case of home care nursing documentation in Switzerland. We assessed different transcription tools and models, and conducted several experiments with OpenAI Whisper, involving different variations of German (i.e., dialects, foreign accent) and manually curated example texts by a domain expert of home care nursing. Our results indicate that even the used out-of-the-box model performs sufficiently well to be a good starting point for future research in the field.
摘要:自然语言处理 (NLP) 领域的最新进展为不同领域,包括医疗行业,开启了新的应用场景。特别是,转录技术可用于支持护理文档流程的自动化,从而让护士有更多时间与患者互动。然而,需要解决的挑战包括 (a) 数据隐私,(b) 本地语言和方言,以及 © 领域特定词汇。在本案例研究中,我们探讨了瑞士家庭护理文档的案例。我们评估了不同的转录工具和模型,并使用 OpenAI 的 Whisper 进行了多次实验,涉及德语的不同变体(即方言、外国口音)以及由家庭护理领域专家手动整理的示例文本。我们的结果表明,即使使用现成的模型,其表现也足够好,可以作为该领域未来研究的良好起点。

[NLP-8] LLMs4Synthesis: Leveraging Large Language Models for Scientific Synthesis

【速读】: 该论文旨在解决科学文献复杂性和数量不断增加所带来的挑战,特别是如何快速、连贯且富含上下文地整合科学见解。解决方案的关键在于引入LLMs4Synthesis框架,该框架通过结合开源和专有的大型语言模型(LLMs),提升生成高质量科学综述的能力。此外,该框架还探索了LLMs在评估综述完整性和可靠性方面的有效性,弥补了当前量化指标的不足。论文通过开发新的处理科学论文的方法、定义新的综述类型以及建立九项详细的质量评估标准,进一步优化了综述质量,确保其与既定标准的一致性。

链接: https://arxiv.org/abs/2409.18812
作者: Hamed Babaei Giglou,Jennifer D’Souza,Sören Auer
关键词-EN: Large Language Models, Language Models, Large Language, capabilities of Large, generating high-quality scientific
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注: 12 pages, 3 figures, Accepted to JCDL 2024 Research Track

点击查看摘要

Abstract:In response to the growing complexity and volume of scientific literature, this paper introduces the LLMs4Synthesis framework, designed to enhance the capabilities of Large Language Models (LLMs) in generating high-quality scientific syntheses. This framework addresses the need for rapid, coherent, and contextually rich integration of scientific insights, leveraging both open-source and proprietary LLMs. It also examines the effectiveness of LLMs in evaluating the integrity and reliability of these syntheses, alleviating inadequacies in current quantitative metrics. Our study contributes to this field by developing a novel methodology for processing scientific papers, defining new synthesis types, and establishing nine detailed quality criteria for evaluating syntheses. The integration of LLMs with reinforcement learning and AI feedback is proposed to optimize synthesis quality, ensuring alignment with established criteria. The LLMs4Synthesis framework and its components are made available, promising to enhance both the generation and evaluation processes in scientific research synthesis.
摘要:面对科学文献日益增长的复杂性和数量,本文引入了 LLMs4Synthesis 框架,旨在提升大语言模型 (Large Language Models, LLMs) 在生成高质量科学综述方面的能力。该框架解决了快速、连贯且富含上下文信息的科学见解整合需求,同时利用了开源和专有的 LLMs。此外,它还探讨了 LLMs 在评估这些综述的完整性和可靠性方面的有效性,缓解了当前定量指标的不足。我们的研究通过开发处理科学论文的新方法、定义新的综述类型以及建立九项详细的综述质量评估标准,为该领域做出了贡献。提出将 LLMs 与强化学习和 AI 反馈相结合,以优化综述质量,确保与既定标准的一致性。LLMs4Synthesis 框架及其组件的开放,有望提升科学研究综述的生成和评估过程。

[NLP-9] A Survey on the Honesty of Large Language Models

【速读】: 该论文试图解决大型语言模型(LLMs)在诚实性方面存在的问题,即模型在表达已知和未知知识时的不一致性,以及在面对错误信息时的过度自信。解决方案的关键在于提供一个关于LLMs诚实性的全面调查,涵盖其定义、评估方法和改进策略。通过明确诚实性的定义、开发有效的评估方法以及提出改进策略,论文旨在为未来在这一重要领域的研究提供指导和启发。

链接: https://arxiv.org/abs/2409.18786
作者: Siheng Li,Cheng Yang,Taiqiang Wu,Chufan Shi,Yuji Zhang,Xinyu Zhu,Zesen Cheng,Deng Cai,Mo Yu,Lemao Liu,Jie Zhou,Yujiu Yang,Ngai Wong,Xixin Wu,Wai Lam
关键词-EN: large language models, aligning large language, language models, fundamental principle, principle for aligning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL

点击查看摘要

Abstract:Honesty is a fundamental principle for aligning large language models (LLMs) with human values, requiring these models to recognize what they know and don’t know and be able to faithfully express their knowledge. Despite promising, current LLMs still exhibit significant dishonest behaviors, such as confidently presenting wrong answers or failing to express what they know. In addition, research on the honesty of LLMs also faces challenges, including varying definitions of honesty, difficulties in distinguishing between known and unknown knowledge, and a lack of comprehensive understanding of related research. To address these issues, we provide a survey on the honesty of LLMs, covering its clarification, evaluation approaches, and strategies for improvement. Moreover, we offer insights for future research, aiming to inspire further exploration in this important area.
摘要:诚实是使大语言模型 (LLMs) 与人类价值观相一致的基本原则,要求这些模型能够识别它们知道和不知道的内容,并能够忠实地表达它们的知识。尽管前景光明,当前的 LLMs 仍然表现出显著的不诚实行为,例如自信地提供错误答案或未能表达它们所知道的内容。此外,关于 LLMs 诚实性的研究也面临挑战,包括诚实定义的多样性、区分已知和未知知识的困难,以及对相关研究缺乏全面理解。为了解决这些问题,我们提供了一份关于 LLMs 诚实性的调查,涵盖了其澄清、评估方法和改进策略。此外,我们还为未来的研究提供了见解,旨在激发对该重要领域的进一步探索。

[NLP-10] Charting the Future: Using Chart Question-Answering for Scalable Evaluation of LLM-Driven Data Visualizations

【速读】: 该论文试图解决传统数据可视化评估方法依赖人工判断、成本高且难以扩展的问题,以及忽视视觉传达效果的局限性。解决方案的关键在于利用视觉问答(VQA)模型自动化评估由大型语言模型(LLM)生成的数据可视化,通过VQA模型评估数据表示质量和图表的视觉传达清晰度,从而加速研究进程,减少对人工标注的依赖。

链接: https://arxiv.org/abs/2409.18764
作者: James Ford,Xingmeng Zhao,Dan Schumacher,Anthony Rios
关键词-EN: Visual Question Answering, leverages Visual Question, Question Answering, Visual Question, framework that leverages
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose a novel framework that leverages Visual Question Answering (VQA) models to automate the evaluation of LLM-generated data visualizations. Traditional evaluation methods often rely on human judgment, which is costly and unscalable, or focus solely on data accuracy, neglecting the effectiveness of visual communication. By employing VQA models, we assess data representation quality and the general communicative clarity of charts. Experiments were conducted using two leading VQA benchmark datasets, ChartQA and PlotQA, with visualizations generated by OpenAI’s GPT-3.5 Turbo and Meta’s Llama 3.1 70B-Instruct models. Our results indicate that LLM-generated charts do not match the accuracy of the original non-LLM-generated charts based on VQA performance measures. Moreover, while our results demonstrate that few-shot prompting significantly boosts the accuracy of chart generation, considerable progress remains to be made before LLMs can fully match the precision of human-generated graphs. This underscores the importance of our work, which expedites the research process by enabling rapid iteration without the need for human annotation, thus accelerating advancements in this field.
摘要:我们提出了一种新颖的框架,利用视觉问答 (Visual Question Answering, VQA) 模型来自动评估大语言模型 (Large Language Model, LLM) 生成的数据可视化。传统的评估方法通常依赖于人工判断,这种方法成本高且难以扩展,或者仅关注数据准确性,忽视了视觉传达的有效性。通过采用 VQA 模型,我们评估了数据表示质量和图表的总体传达清晰度。实验使用了两个领先的 VQA 基准数据集,ChartQA 和 PlotQA,可视化由 OpenAI 的 GPT-3.5 Turbo 和 Meta 的 Llama 3.1 70B-Instruct 模型生成。我们的结果表明,基于 VQA 性能指标,LLM 生成的图表在准确性上未能达到原始非 LLM 生成图表的水平。此外,尽管我们的结果显示少样本提示 (few-shot prompting) 显著提升了图表生成的准确性,但在 LLM 完全匹配人类生成图表的精度之前,仍需取得显著进展。这突显了我们工作的重要性,通过无需人工标注即可实现快速迭代,从而加速了该领域的研究进展。

[NLP-11] Cross-Domain Keyword Extraction with Keyness Patterns

【速读】: 该论文试图解决监督式关键词提取中的领域依赖性和标注主观性问题。解决方案的关键在于提出了一种基于排序的方法,通过结合独立特征(如子语言领域和术语长度)和三类依赖特征(启发式特征、特异性特征和代表性特征)来识别关键词。该方法利用两个基于卷积神经网络的模型从关键词数据集中学习关键性模式,并通过bootstrap采样策略来克服标注主观性。实验结果表明,该方法不仅在一般监督式关键词提取中达到了最先进的性能,而且在跨领域性能上也表现出色,这归因于社区级别的关键性模式数量有限且在一定程度上独立于语言领域,以及独立特征与依赖特征的区分和采样训练策略的有效性。

链接: https://arxiv.org/abs/2409.18724
作者: Dongmei Zhou,Xuri Tang
关键词-EN: subjectivity pose challenges, supervised keyword extraction, keyword extraction, annotation subjectivity pose, keyness patterns
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注: 26 pages, 14 figures

点击查看摘要

Abstract:Domain dependence and annotation subjectivity pose challenges for supervised keyword extraction. Based on the premises that second-order keyness patterns are existent at the community level and learnable from annotated keyword extraction datasets, this paper proposes a supervised ranking approach to keyword extraction that ranks keywords with keyness patterns consisting of independent features (such as sublanguage domain and term length) and three categories of dependent features – heuristic features, specificity features, and representavity features. The approach uses two convolutional-neural-network based models to learn keyness patterns from keyword datasets and overcomes annotation subjectivity by training the two models with bootstrap sampling strategy. Experiments demonstrate that the approach not only achieves state-of-the-art performance on ten keyword datasets in general supervised keyword extraction with an average top-10-F-measure of 0.316 , but also robust cross-domain performance with an average top-10-F-measure of 0.346 on four datasets that are excluded in the training process. Such cross-domain robustness is attributed to the fact that community-level keyness patterns are limited in number and temperately independent of language domains, the distinction between independent features and dependent features, and the sampling training strategy that balances excess risk and lack of negative training data.
摘要:领域依赖性和标注主观性为监督式关键词提取带来了挑战。基于社群层面存在二阶关键性模式且可以从标注的关键词提取数据集中学习的假设,本文提出了一种监督式排序方法,用于关键词提取,该方法通过包含独立特征(如子语言领域和术语长度)和三类依赖特征——启发式特征、特异性特征和代表性特征的关键性模式对关键词进行排序。该方法采用两个基于卷积神经网络的模型,从关键词数据集中学习关键性模式,并通过自举采样策略训练这两个模型,以克服标注主观性。实验表明,该方法不仅在一般监督式关键词提取中,在十个关键词数据集上实现了平均前10名F度量为0.316的最新性能,而且在训练过程中排除的四个数据集上实现了平均前10名F度量为0.346的跨领域鲁棒性能。这种跨领域鲁棒性归因于社群层面的关键性模式数量有限且暂时独立于语言领域,独立特征与依赖特征之间的区别,以及平衡过度风险和缺乏负训练数据的采样训练策略。

[NLP-12] Read Over the Lines: Attacking LLMs and Toxicity Detection Systems with ASCII Art to Mask Profanity

【速读】: 该论文试图解决语言模型在处理ASCII艺术时存在的安全漏洞问题,即通过利用模型无法正确解释ASCII艺术的特点来实施对抗攻击。解决方案的关键在于设计了一种新型的对抗攻击方法,通过创建两种自定义的ASCII艺术字体(一种利用特殊标记,另一种使用填充文本的字母形状),成功地在十个模型上实现了100%的攻击成功率,包括OpenAI的o1-preview和LLaMA 3.1。

链接: https://arxiv.org/abs/2409.18708
作者: Sergey Berezin,Reza Farahbakhsh,Noel Crespi
关键词-EN: interpret ASCII art, ASCII art fonts, ASCII art, custom ASCII art, interpret ASCII
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:We introduce a novel family of adversarial attacks that exploit the inability of language models to interpret ASCII art. To evaluate these attacks, we propose the ToxASCII benchmark and develop two custom ASCII art fonts: one leveraging special tokens and another using text-filled letter shapes. Our attacks achieve a perfect 1.0 Attack Success Rate across ten models, including OpenAI’s o1-preview and LLaMA 3.1. Warning: this paper contains examples of toxic language used for research purposes. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2409.18708 [cs.CL] (or arXiv:2409.18708v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.18708 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:我们介绍了一种新颖的对抗攻击家族,这些攻击利用了语言模型无法解释 ASCII 艺术的特点。为了评估这些攻击,我们提出了 ToxASCII 基准,并开发了两种自定义 ASCII 艺术字体:一种利用特殊 Token,另一种使用填充文本的字母形状。我们的攻击在包括 OpenAI 的 o1-preview 和 LLaMA 3.1 在内的十种模型上实现了完美的 1.0 攻击成功率。警告:本文包含用于研究目的的有害语言示例。

主题:计算与语言 (cs.CL);人工智能 (cs.AI);密码学与安全 (cs.CR)
引用为:arXiv:2409.18708 [cs.CL] (或 arXiv:2409.18708v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2409.18708
通过 DataCite 发布的 arXiv-issued DOI (待注册)

[NLP-13] KALE-LM: Unleash The Power Of AI For Science Via Knowledge And Logic Enhanced Large Model

【速读】: 该论文试图解决如何利用人工智能(AI)推动科学研究的问题,其解决方案的关键在于提出了一个名为Llama3-KALE-LM-Chem-8B的大型语言模型,该模型在化学领域的任务中表现出色。通过开源这一模型,论文希望为实现更智能的AI提供一个强有力的起点,从而促进人类科技和社会的发展。

链接: https://arxiv.org/abs/2409.18695
作者: Weichen Dai,Yezeng Chen,Zijie Dai,Zhijie Huang,Yubo Liu,Yixuan Pan,Baiyang Song,Chengli Zhong,Xinhe Li,Zeyu Wang,Zhuoying Feng,Yi Zhou
关键词-EN: advance scientific research, Artificial intelligence, immense potential, intelligence is gradually, gradually demonstrating
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Artificial intelligence is gradually demonstrating its immense potential, and increasing attention is being given to how AI can be harnessed to advance scientific research. In this vision paper, we present our perspectives on how AI can better assist scientific inquiry and explore corresponding technical approach. We have proposed and open-sourced a large model of our KALE-LM model series, Llama3-KALE-LM-Chem-8B, which has achieved outstanding performance in tasks related to the field of chemistry. We hope that our work serves as a strong starting point, helping to realize more intelligent AI and promoting the advancement of human science and technology, as well as societal development.
摘要:人工智能正逐渐展现出其巨大的潜力,越来越多的人开始关注如何利用 AI 推动科学研究的发展。在这篇愿景论文中,我们阐述了 AI 如何更好地辅助科学探究,并探讨了相应的技术途径。我们提出并开源了 KALE-LM 模型系列中的一个大模型,即 Llama3-KALE-LM-Chem-8B,该模型在化学领域的相关任务中表现出色。我们希望这项工作能成为一个强有力的起点,助力实现更智能的 AI,并推动人类科学技术及社会的发展。

[NLP-14] Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models EMNLP24

【速读】: 该论文试图解决现有音频大语言模型(ALLMs)在处理多音频流场景时的不足,特别是在多任务和多音频输入情况下的表现。解决方案的关键在于提出了首个多音频评估(MAE)基准,并设计了一种新的多音频大语言模型(MALLM),通过在合成数据上进行判别学习来捕捉多个相似音频之间的上下文关系。实验结果表明,MALLM在多音频处理任务中显著优于现有模型,且在无需人工标注的情况下实现了高效的数据利用,为ALLMs在多音频处理领域的发展提供了新的方向。

链接: https://arxiv.org/abs/2409.18680
作者: Yiming Chen,Xianghu Yue,Xiaoxue Gao,Chen Zhang,Luis Fernando D’Haro,Robby T. Tan,Haizhou Li
关键词-EN: unified model, explored recently, recently for tackling, proposed MALLM, audio
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: EMNLP24 Findings

点击查看摘要

Abstract:Various audio-LLMs (ALLMs) have been explored recently for tackling different audio tasks simultaneously using a single, unified model. While existing evaluations of ALLMs primarily focus on single-audio tasks, real-world applications often involve processing multiple audio streams simultaneously. To bridge this gap, we propose the first multi-audio evaluation (MAE) benchmark that consists of 20 datasets from 11 multi-audio tasks encompassing both speech and sound scenarios. Comprehensive experiments on MAE demonstrate that the existing ALLMs, while being powerful in comprehending primary audio elements in individual audio inputs, struggling to handle multi-audio scenarios. To this end, we propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios using discriminative learning on our proposed synthetic data. The results demonstrate that the proposed MALLM outperforms all baselines and achieves high data efficiency using synthetic data without requiring human annotations. The proposed MALLM opens the door for ALLMs towards multi-audio processing era and brings us closer to replicating human auditory capabilities in machines.
摘要:近年来,各种音频大语言模型 (Audio-LLM, ALLM) 被探索用于通过单一、统一的模型同时处理不同的音频任务。尽管现有的 ALLM 评估主要集中在单一音频任务上,但现实应用中往往涉及同时处理多个音频流。为了填补这一空白,我们提出了首个多音频评估 (Multi-Audio Evaluation, MAE) 基准,该基准包含来自 11 个多音频任务的 20 个数据集,涵盖语音和声音场景。在 MAE 上的综合实验表明,现有的 ALLM 虽然在理解单个音频输入中的主要音频元素方面表现出色,但在处理多音频场景时却显得力不从心。为此,我们提出了一种新的多音频大语言模型 (Multi-Audio-LLM, MALLM),通过在我们提出的合成数据上进行判别学习,捕捉多个相似音频之间的音频上下文。结果显示,所提出的 MALLM 在所有基线中表现最佳,并实现了高数据效率,无需人工标注即可使用合成数据。所提出的 MALLM 为 ALLM 开启了多音频处理时代的大门,并使我们更接近于在机器中复制人类听觉能力。

[NLP-15] “Why” Has the Least Side Effect on Model Editing

【速读】: 该论文试图解决大型语言模型(LLMs)在知识更新过程中产生的性能下降和意外副作用问题。解决方案的关键在于通过分类模型编辑问题类型,揭示不同问题类型对性能下降的影响程度,从而优化实验设计。研究还发现,小型模型的研究结果不能直接应用于大型模型,且增加批处理大小可以缓解性能下降问题。

链接: https://arxiv.org/abs/2409.18679
作者: Tsung-Hsuan Pan,Chung-Chi Chen,Hen-Hsen Huang,Hsin-Hsi Chen
关键词-EN: Training large language, Training large, knowledge continually evolves, large language models, world knowledge continually
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Training large language models (LLMs) from scratch is an expensive endeavor, particularly as world knowledge continually evolves. To maintain relevance and accuracy of LLMs, model editing has emerged as a pivotal research area. While these methods hold promise, they can also produce unintended side effects. Their underlying factors and causes remain largely unexplored. This paper delves into a critical factor-question type-by categorizing model editing questions. Our findings reveal that the extent of performance degradation varies significantly across different question types, providing new insights for experimental design in knowledge editing. Furthermore, we investigate whether insights from smaller models can be extrapolated to larger models. Our results indicate discrepancies in findings between models of different sizes, suggesting that insights from smaller models may not necessarily apply to larger models. Additionally, we examine the impact of batch size on side effects, discovering that increasing the batch size can mitigate performance drops.
摘要:从头开始训练大语言模型 (LLMs) 是一项昂贵的任务,尤其是在世界知识不断演变的情况下。为了保持 LLMs 的相关性和准确性,模型编辑已成为一个关键的研究领域。尽管这些方法具有潜力,但它们也可能产生意外的副作用。其背后的因素和原因在很大程度上仍未被探索。本文深入探讨了一个关键因素——问题类型——通过分类模型编辑问题。我们的研究发现,性能下降的程度在不同问题类型之间存在显著差异,为知识编辑的实验设计提供了新的见解。此外,我们探讨了从小模型中获得的见解是否可以外推到更大的模型。我们的结果表明,不同大小的模型之间在研究结果上存在差异,这表明小模型的见解不一定适用于大模型。此外,我们研究了批量大小对副作用的影响,发现增加批量大小可以缓解性能下降。

[NLP-16] Rehearsing Answers to Probable Questions with Perspective-Taking

链接: https://arxiv.org/abs/2409.18678
作者: Yung-Yu Shih,Ziwei Xu,Hiroya Takamura,Yun-Nung Chen,Chung-Chi Chen
关键词-EN:
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-17] Co-Trained Retriever-Generator Framework for Question Generation in Earnings Calls

【速读】: 该论文旨在解决在企业财报电话会议等专业环境中,如何准确预测分析师提问的问题。解决方案的关键在于提出了多问题生成(MQG)任务,并采用了一种基于财报电话会议文本的全新注释技术和检索增强策略。通过收集大量财报电话会议记录并进行细致的分类标注,结合信息检索技术提取相关内容,从而生成一系列可能的分析师提问。实证评估表明,该方法在生成问题的准确性、一致性和困惑度方面表现优异。

链接: https://arxiv.org/abs/2409.18677
作者: Yining Juan,Chung-Chi Chen,Hen-Hsen Huang,Hsin-Hsi Chen
关键词-EN: diverse professional environments, questions stands paramount, stands paramount, ability to anticipate, corporate earnings calls
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In diverse professional environments, ranging from academic conferences to corporate earnings calls, the ability to anticipate audience questions stands paramount. Traditional methods, which rely on manual assessment of an audience’s background, interests, and subject knowledge, often fall short - particularly when facing large or heterogeneous groups, leading to imprecision and inefficiency. While NLP has made strides in text-based question generation, its primary focus remains on academic settings, leaving the intricate challenges of professional domains, especially earnings call conferences, underserved. Addressing this gap, our paper pioneers the multi-question generation (MQG) task specifically designed for earnings call contexts. Our methodology involves an exhaustive collection of earnings call transcripts and a novel annotation technique to classify potential questions. Furthermore, we introduce a retriever-enhanced strategy to extract relevant information. With a core aim of generating a spectrum of potential questions that analysts might pose, we derive these directly from earnings call content. Empirical evaluations underscore our approach’s edge, revealing notable excellence in the accuracy, consistency, and perplexity of the questions generated.
摘要:在从学术会议到企业财报电话会议的多样化专业环境中,预测听众提问的能力至关重要。传统方法依赖于对听众背景、兴趣和专业知识的评估,往往在面对大规模或异质群体时表现不佳,导致不精确和低效。尽管自然语言处理 (NLP) 在基于文本的提问生成方面取得了进展,但其主要关注点仍集中在学术场景,未能充分解决专业领域,特别是财报电话会议的复杂挑战。针对这一空白,本文开创性地提出了专为财报电话会议场景设计的多问题生成 (MQG) 任务。我们的方法包括对财报电话会议记录的全面收集和一种新颖的标注技术,用于分类潜在问题。此外,我们引入了一种增强检索策略来提取相关信息。核心目标是生成分析师可能提出的各种潜在问题,这些问题直接从财报电话会议内容中提取。实证评估凸显了我们方法的优势,显示出在生成问题的准确性、一致性和困惑度方面的显著优势。

[NLP-18] HiCuLR: Hierarchical Curriculum Learning for Rhetorical Role Labeling of Legal Documents EMNLP2024

【速读】: 该论文试图解决法律文档修辞角色标注(RRL)中存在的不同难度级别问题,解决方案的关键在于提出了一个分层课程学习框架HiCuLR。该框架嵌套了两个课程:外层的修辞角色级别课程(RC)和内层的文档级别课程(DC)。DC根据文档的难度进行分类,利用偏离标准话语结构的度量,以从易到难的方式逐步训练模型。RC则逐步增强模型对修辞角色从粗到细粒度区分的辨别能力。实验结果表明,HiCuLR在四个RRL数据集上的有效性,突显了DC和RC的互补性。

链接: https://arxiv.org/abs/2409.18647
作者: T.Y.S.S. Santosh,Apolline Isaia,Shiyu Hong,Matthias Grabmair
关键词-EN: Rhetorical Role Labeling, semantic case search, Role Labeling, semantic case, argument mining
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2024 Findings

点击查看摘要

Abstract:Rhetorical Role Labeling (RRL) of legal documents is pivotal for various downstream tasks such as summarization, semantic case search and argument mining. Existing approaches often overlook the varying difficulty levels inherent in legal document discourse styles and rhetorical roles. In this work, we propose HiCuLR, a hierarchical curriculum learning framework for RRL. It nests two curricula: Rhetorical Role-level Curriculum (RC) on the outer layer and Document-level Curriculum (DC) on the inner layer. DC categorizes documents based on their difficulty, utilizing metrics like deviation from a standard discourse structure and exposes the model to them in an easy-to-difficult fashion. RC progressively strengthens the model to discern coarse-to-fine-grained distinctions between rhetorical roles. Our experiments on four RRL datasets demonstrate the efficacy of HiCuLR, highlighting the complementary nature of DC and RC.
摘要:法律文档的修辞角色标注 (Rhetorical Role Labeling, RRL) 对于摘要、语义案例搜索和论点挖掘等下游任务至关重要。现有方法往往忽视了法律文档话语风格和修辞角色中固有的不同难度级别。在本研究中,我们提出了 HiCuLR,一个用于 RRL 的分层课程学习框架。该框架嵌套了两个课程:外层的修辞角色级别课程 (Rhetorical Role-level Curriculum, RC) 和内层的文档级别课程 (Document-level Curriculum, DC)。DC 根据文档的难度进行分类,利用偏离标准话语结构的度量指标,并以由易到难的方式向模型展示这些文档。RC 逐步增强模型,使其能够辨别从粗粒度到细粒度的修辞角色差异。我们在四个 RRL 数据集上的实验证明了 HiCuLR 的有效性,突显了 DC 和 RC 的互补性。

[NLP-19] he Craft of Selective Prediction: Towards Reliable Case Outcome Classification – An Empirical Study on European Court of Human Rights Cases EMNLP

【速读】: 该论文试图解决在法律自然语言处理(NLP)中高风险决策任务(如案件结果分类)中模型预测置信度的量化问题。解决方案的关键在于通过系统性实验探讨预训练语料库、置信度估计器和微调损失函数等设计选择对模型可靠性的影响。研究结果表明,多样且领域特定的预训练语料库有助于模型校准,较大模型易表现出过度自信,而蒙特卡罗弃权方法能提供可靠的置信度估计,自信错误正则化则有效缓解过度自信问题。这是首次在法律NLP领域系统性地探索选择性预测,强调了进一步研究提升置信度测量和增强模型可信度的重要性。

链接: https://arxiv.org/abs/2409.18645
作者: T.Y.S.S. Santosh,Irtiza Chowdhury,Shanshan Xu,Matthias Grabmair
关键词-EN: Case Outcome Classification, Outcome Classification, high-stakes decision-making tasks, Case Outcome, high-stakes decision-making
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP Findings

点击查看摘要

Abstract:In high-stakes decision-making tasks within legal NLP, such as Case Outcome Classification (COC), quantifying a model’s predictive confidence is crucial. Confidence estimation enables humans to make more informed decisions, particularly when the model’s certainty is low, or where the consequences of a mistake are significant. However, most existing COC works prioritize high task performance over model reliability. This paper conducts an empirical investigation into how various design choices including pre-training corpus, confidence estimator and fine-tuning loss affect the reliability of COC models within the framework of selective prediction. Our experiments on the multi-label COC task, focusing on European Court of Human Rights (ECtHR) cases, highlight the importance of a diverse yet domain-specific pre-training corpus for better calibration. Additionally, we demonstrate that larger models tend to exhibit overconfidence, Monte Carlo dropout methods produce reliable confidence estimates, and confident error regularization effectively mitigates overconfidence. To our knowledge, this is the first systematic exploration of selective prediction in legal NLP. Our findings underscore the need for further research on enhancing confidence measurement and improving the trustworthiness of models in the legal domain.
摘要:在法律自然语言处理 (NLP) 中的高风险决策任务,如案件结果分类 (Case Outcome Classification, COC),量化模型的预测置信度至关重要。置信度估计使人类能够做出更明智的决策,特别是在模型确定性较低或错误后果严重的情况下。然而,大多数现有的 COC 工作更倾向于高任务性能而非模型可靠性。本文在选择性预测框架下,对包括预训练语料库、置信度估计器和微调损失在内的各种设计选择如何影响 COC 模型的可靠性进行了实证研究。我们在多标签 COC 任务上的实验,特别是针对欧洲人权法院 (European Court of Human Rights, ECtHR) 的案件,强调了多样化且领域特定的预训练语料库对于更好校准的重要性。此外,我们证明了大模型往往表现出过度自信,蒙特卡罗 dropout 方法能产生可靠的置信度估计,而置信错误正则化能有效缓解过度自信。据我们所知,这是首次在法律 NLP 领域系统性地探索选择性预测。我们的研究结果强调了进一步研究增强置信度测量和提高法律领域模型可信度的重要性。

[NLP-20] Incorporating Precedents for Legal Judgement Prediction on European Court of Human Rights Cases EMNLP

【速读】: 该论文试图解决在法律判决预测(LJP)模型中如何有效利用先例(precedents)的问题。解决方案的关键在于通过训练一个基于案件间指控文章重叠比的细粒度相关性信号的检索器,来促进先例的检索,并提出两种策略来整合先例:一是在推理阶段通过基于案件接近度的标签插值直接引入先例,二是在训练阶段通过使用堆叠交叉注意力模型的先例融合模块引入先例。此外,通过联合训练检索器和LJP模型,解决了两者潜在空间的分歧问题。实验结果表明,结合训练阶段引入先例和联合训练的方法,显著优于仅在推理阶段引入先例或不使用先例的模型,特别是在涉及较少案例的文章中表现更为突出。

链接: https://arxiv.org/abs/2409.18644
作者: T.Y.S.S. Santosh,Mohamed Hesham Elganayni,Stanisław Sójka,Matthias Grabmair
关键词-EN: stare decisis, informed decision-making, legal doctrine, doctrine of stare, explore methods
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP Findings

点击查看摘要

Abstract:Inspired by the legal doctrine of stare decisis, which leverages precedents (prior cases) for informed decision-making, we explore methods to integrate them into LJP models. To facilitate precedent retrieval, we train a retriever with a fine-grained relevance signal based on the overlap ratio of alleged articles between cases. We investigate two strategies to integrate precedents: direct incorporation at inference via label interpolation based on case proximity and during training via a precedent fusion module using a stacked-cross attention model. We employ joint training of the retriever and LJP models to address latent space divergence between them. Our experiments on LJP tasks from the ECHR jurisdiction reveal that integrating precedents during training coupled with joint training of the retriever and LJP model, outperforms models without precedents or with precedents incorporated only at inference, particularly benefiting sparser articles.
摘要:受法律中的遵循先例原则启发,该原则利用先例(过往案例)进行明智的决策,我们探索了将其整合到法律判决预测 (LJP) 模型中的方法。为了便于先例检索,我们训练了一个检索器,该检索器基于案件之间涉嫌条款的重叠比例,采用细粒度的相关性信号。我们研究了两种整合先例的策略:一种是在推理过程中通过基于案件接近度的标签插值直接引入,另一种是在训练过程中通过使用堆叠交叉注意力模型的先例融合模块引入。我们采用检索器和 LJP 模型的联合训练来解决它们之间潜在空间的差异。我们在欧洲人权法院 (ECHR) 管辖下的 LJP 任务上的实验表明,结合训练过程中的先例整合与检索器和 LJP 模型的联合训练,优于不使用先例或仅在推理时引入先例的模型,特别是在涉及较少条款的情况下。

[NLP-21] Model-based Preference Optimization in Abstractive Summarization without Human Feedback EMNLP2024

【速读】: 该论文试图解决在抽象摘要生成中,大型语言模型(LLMs)由于引入未在原文中出现的内容而导致的不准确性问题。解决方案的关键在于提出了一种名为模型偏好优化(Model-based Preference Optimization, MPO)的新方法,通过利用模型自身的摘要生成能力,生成一个完全由模型生成的偏好数据集,并使用不同的解码策略进行微调,从而在不依赖人类反馈的情况下显著提升生成摘要的质量。

链接: https://arxiv.org/abs/2409.18618
作者: Jaepill Choi,Kyubyung Chae,Jiwoo Song,Yohan Jo,Taesup Kim
关键词-EN: accurate summaries arises, Large Language Models, challenge of producing, producing concise, concise and accurate
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP 2024

点击查看摘要

Abstract:In abstractive summarization, the challenge of producing concise and accurate summaries arises from the vast amount of information contained in the source document. Consequently, although Large Language Models (LLMs) can generate fluent text, they often introduce inaccuracies by hallucinating content not found in the original source. While supervised fine-tuning methods that maximize likelihood contribute to this issue, they do not consistently enhance the faithfulness of the summaries. Preference-based optimization methods, such as Direct Preference Optimization (DPO), can further refine the model to align with human preferences. However, these methods still heavily depend on costly human feedback. In this work, we introduce a novel and straightforward approach called Model-based Preference Optimization (MPO) to fine-tune LLMs for improved summarization abilities without any human feedback. By leveraging the model’s inherent summarization capabilities, we create a preference dataset that is fully generated by the model using different decoding strategies. Our experiments on standard summarization datasets and various metrics demonstrate that our proposed MPO significantly enhances the quality of generated summaries without relying on human feedback.
摘要:在抽象摘要生成中,由于源文档包含大量信息,生成简洁且准确的摘要是一个挑战。因此,尽管大语言模型 (LLM) 能够生成流畅的文本,但它们常常通过生成源文档中未出现的内容而引入不准确性。虽然最大化似然的监督微调方法有助于解决这一问题,但它们并不能持续提高摘要的忠实度。基于偏好的优化方法,如直接偏好优化 (DPO),可以进一步细化模型以符合人类偏好。然而,这些方法仍然严重依赖于昂贵的人类反馈。在本研究中,我们提出了一种新颖且直接的方法,称为基于模型的偏好优化 (MPO),用于微调 LLM 以提高摘要能力,且无需任何人类反馈。通过利用模型固有的摘要能力,我们创建了一个完全由模型使用不同解码策略生成的偏好数据集。我们在标准摘要数据集和各种评估指标上的实验表明,我们提出的 MPO 显著提高了生成摘要的质量,而无需依赖人类反馈。

[NLP-22] Do LLMs suffer from Multi-Party Hangover? A Diagnostic Approach to Addressee Recognition and Response Selection in Conversations EMNLP2024

链接: https://arxiv.org/abs/2409.18602
作者: Nicolò Penzo,Maryam Sajedinia,Bruno Lepri,Sara Tonelli,Marco Guerini
关键词-EN:
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2024 main conference

点击查看摘要

[NLP-23] ASAG2024: A Combined Benchmark for Short Answer Grading

【速读】: 该论文试图解决自动化短答案评分(SAG)系统缺乏跨学科、评分尺度和分布的综合基准问题,以评估现有方法的通用性。解决方案的关键在于引入ASAG2024基准,该基准整合了七个常用的短答案评分数据集,并统一了结构和评分尺度,从而为自动化评分系统的比较提供了标准化的平台。通过评估一系列最新的SAG方法,研究发现基于大语言模型(LLM)的方法虽然取得了较高分数,但仍远未达到人类评分水平,这为未来研究人机协作的SAG系统开辟了新的方向。

链接: https://arxiv.org/abs/2409.18596
作者: Gérôme Meyer,Philip Breuer,Jonathan Fürst
关键词-EN: Open-ended questions test, Open-ended questions, preferred assessment method, understanding than closed-ended, preferred assessment
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at SIGCSE-Virtual 2024

点击查看摘要

Abstract:Open-ended questions test a more thorough understanding than closed-ended questions and are often a preferred assessment method. However, open-ended questions are tedious to grade and subject to personal bias. Therefore, there have been efforts to speed up the grading process through automation. Short Answer Grading (SAG) systems aim to automatically score students’ answers. Despite growth in SAG methods and capabilities, there exists no comprehensive short-answer grading benchmark across different subjects, grading scales, and distributions. Thus, it is hard to assess the capabilities of current automated grading methods in terms of their generalizability. In this preliminary work, we introduce the combined ASAG2024 benchmark to facilitate the comparison of automated grading systems. Combining seven commonly used short-answer grading datasets in a common structure and grading scale. For our benchmark, we evaluate a set of recent SAG methods, revealing that while LLM-based approaches reach new high scores, they still are far from reaching human performance. This opens up avenues for future research on human-machine SAG systems.
摘要:开放式问题比封闭式问题更能测试出深入的理解,因此通常是首选的评估方法。然而,开放式问题的评分过程繁琐且容易受到个人偏见的影响。因此,人们一直在努力通过自动化来加快评分过程。简答评分 (Short Answer Grading, SAG) 系统旨在自动评分学生的答案。尽管 SAG 方法和能力有所增长,但目前尚不存在一个涵盖不同学科、评分标准和分布的综合性简答评分基准。因此,很难评估当前自动化评分方法在通用性方面的能力。在这项初步工作中,我们引入了 ASAG2024 基准,以促进自动化评分系统的比较。该基准结合了七个常用的简答评分数据集,采用统一的结构和评分标准。在我们的基准测试中,我们评估了一系列最新的 SAG 方法,结果显示,基于大语言模型 (Large Language Model, LLM) 的方法达到了新的高分,但仍远未达到人类的表现水平。这为未来的人机 SAG 系统研究开辟了新的途径。

[NLP-24] “Oh LLM Im Asking Thee Please Give Me a Decision Tree”: Zero-Shot Decision Tree Induction and Embedding with Large Language Models

【速读】: 该论文试图解决在数据有限的情况下,如何利用大型语言模型(LLMs)的先验知识生成可解释的机器学习模型的问题。解决方案的关键在于利用LLMs的压缩世界知识,无需任何训练数据即可生成决策树,这些零样本决策树在某些小型表格数据集上表现优于数据驱动的决策树,并且由这些树生成的嵌入在平均水平上与数据驱动的树嵌入相当。因此,该论文提出的知识驱动决策树生成和嵌入方法为低数据情况下的数据驱动机器学习方法提供了强有力的新的基准。

链接: https://arxiv.org/abs/2409.18594
作者: Ricardo Knauer,Mario Koddenbrock,Raphael Wallsberger,Nicholas M. Brisson,Georg N. Duda,Deborah Falla,David W. Evans,Erik Rodner
关键词-EN: Large language models, Large language, leverage prior knowledge, provide powerful, leverage prior
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) provide powerful means to leverage prior knowledge for predictive modeling when data is limited. In this work, we demonstrate how LLMs can use their compressed world knowledge to generate intrinsically interpretable machine learning models, i.e., decision trees, without any training data. We find that these zero-shot decision trees can surpass data-driven trees on some small-sized tabular datasets and that embeddings derived from these trees perform on par with data-driven tree-based embeddings on average. Our knowledge-driven decision tree induction and embedding approaches therefore serve as strong new baselines for data-driven machine learning methods in the low-data regime.
摘要:大语言模型 (LLMs) 在数据有限的情况下,为利用先验知识进行预测建模提供了强大的手段。在本研究中,我们展示了 LLMs 如何利用其压缩的世界知识生成内在可解释的机器学习模型,即决策树,而无需任何训练数据。我们发现,这些零样本决策树在一些小型表格数据集上能够超越数据驱动的决策树,并且这些树生成的嵌入在平均表现上与数据驱动的基于树的嵌入相当。因此,我们的知识驱动决策树归纳和嵌入方法为低数据情况下的数据驱动机器学习方法提供了强有力的新基准。

[NLP-25] Hit the Sweet Spot! Span-Level Ensemble for Large Language Models

【速读】: 该论文试图解决现有大语言模型(LLMs)在样本级和标记级集成方法中存在的局限性,即样本级方法无法在生成过程中进行动态修正,而标记级方法由于单个标记信息有限导致每一步决策次优。解决方案的关键在于提出了一种跨度级(span-level)集成方法SweetSpan,该方法通过候选模型独立生成候选跨度并利用困惑度评分进行相互评估,从而在生成过程中实现实时调整与准确决策的平衡。这一方法在标准和更具挑战性的集成模型设置下,均展示了其有效性、鲁棒性和广泛适用性。

链接: https://arxiv.org/abs/2409.18583
作者: Yangyifan Xu,Jianghao Chen,Junhong Wu,Jiajun Zhang
关键词-EN: Ensembling various LLMs, highly valuable, LLMs to unlock, unlock their complementary, complementary potential
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Ensembling various LLMs to unlock their complementary potential and leverage their individual strengths is highly valuable. Previous studies typically focus on two main paradigms: sample-level and token-level ensembles. Sample-level ensemble methods either select or blend fully generated outputs, which hinders dynamic correction and enhancement of outputs during the generation process. On the other hand, token-level ensemble methods enable real-time correction through fine-grained ensemble at each generation step. However, the information carried by an individual token is quite limited, leading to suboptimal decisions at each step. To address these issues, we propose SweetSpan, a span-level ensemble method that effectively balances the need for real-time adjustments and the information required for accurate ensemble decisions. Our approach involves two key steps: First, we have each candidate model independently generate candidate spans based on the shared prefix. Second, we calculate perplexity scores to facilitate mutual evaluation among the candidate models and achieve robust span selection by filtering out unfaithful scores. To comprehensively evaluate ensemble methods, we propose a new challenging setting (ensemble models with significant performance gaps) in addition to the standard setting (ensemble the best-performing models) to assess the performance of model ensembles in more realistic scenarios. Experimental results in both standard and challenging settings across various language generation tasks demonstrate the effectiveness, robustness, and versatility of our approach compared with previous ensemble methods.
摘要:将多种大语言模型 (LLM) 进行集成以释放其互补潜力并利用各自的优势具有极高的价值。以往的研究通常集中在两种主要范式上:样本级和 Token 级集成。样本级集成方法要么选择要么混合完全生成的输出,这阻碍了在生成过程中对输出的动态修正和增强。另一方面,Token 级集成方法通过在每个生成步骤进行细粒度的集成来实现实时修正。然而,单个 Token 所携带的信息非常有限,导致每一步的决策次优。为了解决这些问题,我们提出了 SweetSpan,一种跨度级集成方法,该方法有效地平衡了实时调整的需求和准确集成决策所需的信息。我们的方法包括两个关键步骤:首先,我们让每个候选模型基于共享的前缀独立生成候选跨度。其次,我们计算困惑度分数,以促进候选模型之间的相互评估,并通过过滤出不忠实的分数来实现稳健的跨度选择。为了全面评估集成方法,我们除了提出标准设置(集成表现最佳的模型)外,还提出了一种新的挑战性设置(集成性能差距显著的模型),以评估模型集成在更现实场景中的表现。在标准和挑战性设置下,针对各种语言生成任务的实验结果表明,与之前的集成方法相比,我们的方法在有效性、鲁棒性和通用性方面表现出色。

[NLP-26] Research on Predicting Public Opinion Event Heat Levels Based on Large Language Models

【速读】: 该论文试图解决公共舆论事件热度级别的预测问题,解决方案的关键在于利用大型语言模型(如GPT-4o和DeepseekV2)对预处理和分类后的62,836条中文热点事件数据进行自动聚类和热度级别分类,并通过评估模型在无参考案例和有相似案例参考两种场景下的预测准确性,发现这些模型在低热度事件(Level 1)的预测上表现较好,而在高热度事件(Level 4)的预测上准确性有所下降,这表明随着数据集的增强,基于大型语言模型的公共舆论事件热度级别预测具有显著的研究潜力。

链接: https://arxiv.org/abs/2409.18548
作者: Yi Ren,Tianyi Zhang,Weibin Li,DuoMu Zhou,Chenhao Qin,FangCheng Dong
关键词-EN: demonstrated extraordinary capabilities, surpassing human performance, event heat level, heat level prediction, heat level
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: conference

点击查看摘要

Abstract:In recent years, with the rapid development of large language models, serval models such as GPT-4o have demonstrated extraordinary capabilities, surpassing human performance in various language tasks. As a result, many researchers have begun exploring their potential applications in the field of public opinion analysis. This study proposes a novel large-language-models-based method for public opinion event heat level prediction. First, we preprocessed and classified 62,836 Chinese hot event data collected between July 2022 and December 2023. Then, based on each event’s online dissemination heat index, we used the MiniBatchKMeans algorithm to automatically cluster the events and categorize them into four heat levels (ranging from low heat to very high heat). Next, we randomly selected 250 events from each heat level, totalling 1,000 events, to build the evaluation dataset. During the evaluation process, we employed various large language models to assess their accuracy in predicting event heat levels in two scenarios: without reference cases and with similar case references. The results showed that GPT-4o and DeepseekV2 performed the best in the latter case, achieving prediction accuracies of 41.4% and 41.5%, respectively. Although the overall prediction accuracy remains relatively low, it is worth noting that for low-heat (Level 1) events, the prediction accuracies of these two models reached 73.6% and 70.4%, respectively. Additionally, the prediction accuracy showed a downward trend from Level 1 to Level 4, which correlates with the uneven distribution of data across the heat levels in the actual dataset. This suggests that with the more robust dataset, public opinion event heat level prediction based on large language models will have significant research potential for the future.
摘要:近年来,随着大语言模型的快速发展,诸如 GPT-4o 等模型展示了非凡的能力,在多种语言任务中超越了人类表现。因此,许多研究人员开始探索其在舆论分析领域的潜在应用。本研究提出了一种基于大语言模型的舆论事件热度预测新方法。首先,我们对 2022 年 7 月至 2023 年 12 月间收集的 62,836 条中文热点事件数据进行了预处理和分类。然后,基于每个事件的在线传播热度指数,我们使用 MiniBatchKMeans 算法自动对事件进行聚类,并将其分为四个热度等级(从低热到非常高热)。接着,我们从每个热度等级中随机选取 250 个事件,总计 1,000 个事件,构建了评估数据集。在评估过程中,我们采用了多种大语言模型,评估其在无参考案例和有相似案例参考两种场景下预测事件热度等级的准确性。结果显示,在有相似案例参考的情况下,GPT-4o 和 DeepseekV2 表现最佳,分别达到了 41.4% 和 41.5% 的预测准确率。尽管总体预测准确率仍然相对较低,但值得注意的是,对于低热度(一级)事件,这两款模型的预测准确率分别达到了 73.6% 和 70.4%。此外,预测准确率从一级到四级呈现下降趋势,这与实际数据集中热度等级数据分布不均有关。这表明,随着更强大的数据集的建立,基于大语言模型的舆论事件热度预测在未来具有显著的研究潜力。

[NLP-27] A Survey on Complex Tasks for Goal-Directed Interactive Agents

链接: https://arxiv.org/abs/2409.18538
作者: Mareike Hartmann,Alexander Koller
关键词-EN:
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-28] EmoPro: A Prompt Selection Strategy for Emotional Expression in LM-based Speech Synthesis

【速读】: 该论文试图解决现有语音合成模型在情感强度控制方面的不足,提出了一种名为EmoPro的两阶段提示选择策略。该策略的关键在于从情感表达强度、语音质量、文本情感一致性和模型生成性能四个维度评估和选择高质量、高表达力的提示,从而生成更具情感表现力和吸引力的合成语音。

链接: https://arxiv.org/abs/2409.18512
作者: Haoyu Wang,Chunyu Qiang,Tianrui Wang,Cheng Gong,Qiuyu Liu,Yu Jiang,Xiaobao Wang,Chenyang Wang,Chen Zhang
关键词-EN: remarkable zero-shot capabilities, demonstrated remarkable zero-shot, Recent advancements, trained on extensive, extensive datasets
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Recent advancements in speech synthesis models, trained on extensive datasets, have demonstrated remarkable zero-shot capabilities. These models can control content, timbre, and emotion in generated speech based on prompt inputs. Despite these advancements, the choice of prompts significantly impacts the output quality, yet most existing selection schemes do not adequately address the control of emotional intensity. To address this question, this paper proposes a two-stage prompt selection strategy EmoPro, which is specifically designed for emotionally controllable speech synthesis. This strategy focuses on selecting highly expressive and high-quality prompts by evaluating them from four perspectives: emotional expression strength, speech quality, text-emotion consistency, and model generation performance. Experimental results show that prompts selected using the proposed method result in more emotionally expressive and engaging synthesized speech compared to those obtained through baseline. Audio samples and codes will be available at this https URL.
摘要:近年来,基于大规模数据集训练的语音合成模型取得了显著的零样本能力。这些模型能够根据提示输入控制生成语音的内容、音色和情感。尽管取得了这些进展,提示的选择对输出质量有显著影响,但大多数现有的选择方案并未充分解决情感强度的控制问题。为了解决这一问题,本文提出了一种两阶段提示选择策略 EmoPro,该策略专门设计用于情感可控的语音合成。该策略通过从四个角度评估提示:情感表达强度、语音质量、文本-情感一致性和模型生成性能,来选择具有高度表现力和高质量的提示。实验结果表明,与基线方法相比,使用所提出的方法选择的提示生成的合成语音更具情感表现力和吸引力。音频样本和代码将在此 https URL 提供。

[NLP-29] Do We Need Domain-Specific Embedding Models? An Empirical Investigation

【速读】: 该论文试图解决的问题是:在大型语言模型(LLMs)时代,是否仍有必要开发领域特定的嵌入模型,即使通用模型已经在大规模多领域文本上进行了训练。解决方案的关键在于通过引入金融领域特定的嵌入基准(FinMTEB),并对比其在金融领域数据集上的表现与通用基准(MTEB)上的表现,发现现有最先进的嵌入模型在处理领域特定语言和语义模式时表现显著下降。通过量化数据集复杂性并控制这一因素,论文提供了强有力的证据,表明即使在广泛训练的通用模型中,领域特定的嵌入模型仍然具有必要性。

链接: https://arxiv.org/abs/2409.18511
作者: Yixuan Tang,Yi Yang
关键词-EN: Text Embedding Benchmark, NLP applications, Massive Text Embedding, Embedding models, Embedding
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: this https URL

点击查看摘要

Abstract:Embedding models play a crucial role in representing and retrieving information across various NLP applications. Recent advancements in Large Language Models (LLMs) have further enhanced the performance of embedding models, which are trained on massive amounts of text covering almost every domain. These models are often benchmarked on general-purpose datasets like Massive Text Embedding Benchmark (MTEB), where they demonstrate superior performance. However, a critical question arises: Is the development of domain-specific embedding models necessary when general-purpose models are trained on vast corpora that already include specialized domain texts? In this paper, we empirically investigate this question, choosing the finance domain as an example. We introduce the Finance Massive Text Embedding Benchmark (FinMTEB), a counterpart to MTEB that consists of financial domain-specific text datasets. We evaluate the performance of seven state-of-the-art embedding models on FinMTEB and observe a significant performance drop compared to their performance on MTEB. To account for the possibility that this drop is driven by FinMTEB’s higher complexity, we propose four measures to quantify dataset complexity and control for this factor in our analysis. Our analysis provides compelling evidence that state-of-the-art embedding models struggle to capture domain-specific linguistic and semantic patterns, even when trained on large general-purpose corpora. This study sheds light on the necessity of developing domain-specific embedding models in the LLM era, offering valuable insights for researchers and practitioners.
摘要:嵌入模型在各种自然语言处理 (NLP) 应用中扮演着至关重要的角色,用于表示和检索信息。大语言模型 (LLM) 的最新进展进一步提升了嵌入模型的性能,这些模型基于几乎涵盖所有领域的海量文本进行训练。这些模型通常在通用数据集上进行基准测试,如大规模文本嵌入基准 (MTEB),并在其中展现出卓越的性能。然而,一个关键问题浮现:当通用模型已经在包含特定领域文本的庞大语料库上进行训练时,是否有必要开发特定领域的嵌入模型?本文通过实证研究探讨了这一问题,以金融领域为例。我们引入了金融大规模文本嵌入基准 (FinMTEB),这是 MTEB 的对应物,由金融领域特定的文本数据集组成。我们在 FinMTEB 上评估了七种最先进的嵌入模型的性能,并观察到其性能显著下降,相比于在 MTEB 上的表现。为了解释这种下降可能是由 FinMTEB 更高的复杂性驱动的,我们提出了四种量化数据集复杂性的措施,并在分析中控制了这一因素。我们的分析提供了有力证据,表明即使在大规模通用语料库上训练,最先进的嵌入模型也难以捕捉特定领域的语言和语义模式。这项研究揭示了在大语言模型时代开发特定领域嵌入模型的必要性,为研究人员和从业者提供了宝贵的见解。

[NLP-30] Evaluation of OpenAI o1: Opportunities and Challenges of AGI

【速读】: 该论文旨在全面评估OpenAI的o1-preview大型语言模型在复杂推理任务中的表现,涵盖计算机科学、数学、自然科学、医学、语言学和社会科学等多个领域。解决方案的关键在于模型在多个领域中展现出的卓越性能,如在竞争性编程问题中达到83.3%的成功率,超越许多人类专家;在生成放射学报告方面表现优于其他模型;在高中数学推理任务中达到100%的准确率,并提供详细的逐步解决方案;以及在自然语言推理、芯片设计、人类学、地质学、量化投资和社会媒体分析等领域的显著表现。这些结果表明,o1-preview模型在跨领域复杂推理和知识整合方面取得了显著进展,尽管在某些简单问题和高度专业化的概念上仍存在局限性。

链接: https://arxiv.org/abs/2409.18486
作者: Tianyang Zhong,Zhengliang Liu,Yi Pan,Yutong Zhang,Yifan Zhou,Shizhe Liang,Zihao Wu,Yanjun Lyu,Peng Shu,Xiaowei Yu,Chao Cao,Hanqi Jiang,Hanxu Chen,Yiwei Li,Junhao Chen,Huawen Hu,Yihen Liu,Huaqin Zhao,Shaochen Xu,Haixing Dai,Lin Zhao,Ruidong Zhang,Wei Zhao,Zhenyuan Yang,Jingyuan Chen,Peilong Wang,Wei Ruan,Hui Wang,Huan Zhao,Jing Zhang,Yiming Ren,Shihuan Qin,Tong Chen,Jiaxi Li,Arif Hassan Zidan,Afrar Jahin,Minheng Chen,Sichen Xia,Jason Holmes,Yan Zhuang,Jiaqi Wang,Bochen Xu,Weiran Xia,Jichao Yu,Kaibo Tang,Yaxuan Yang,Bolun Sun,Tao Yang,Guoyu Lu,Xianqiao Wang,Lilong Chai,He Li,Jin Lu,Lichao Sun,Xin Zhang,Bao Ge,Xintao Hu,Lian Zhang,Hua Zhou,Lu Zhang,Shu Zhang,Ninghao Liu,Bei Jiang,Linglong Kong,Zhen Xiang,Yudan Ren,Jun Liu,Xi Jiang,Yu Bao,Wei Zhang,Xiang Li,Gang Li,Wei Liu,Dinggang Shen,Andrea Sikora,Xiaoming Zhai,Dajiang Zhu,Tianming Liu
关键词-EN: including computer science, Toggle, computer science, spanning multiple domains, Code Toggle Papers
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This comprehensive study evaluates the performance of OpenAI’s o1-preview large language model across a diverse array of complex reasoning tasks, spanning multiple domains, including computer science, mathematics, natural sciences, medicine, linguistics, and social sciences. Through rigorous testing, o1-preview demonstrated remarkable capabilities, often achieving human-level or superior performance in areas ranging from coding challenges to scientific reasoning and from language processing to creative problem-solving. Key findings include: -83.3% success rate in solving complex competitive programming problems, surpassing many human experts. -Superior ability in generating coherent and accurate radiology reports, outperforming other evaluated models. -100% accuracy in high school-level mathematical reasoning tasks, providing detailed step-by-step solutions. -Advanced natural language inference capabilities across general and specialized domains like medicine. -Impressive performance in chip design tasks, outperforming specialized models in areas such as EDA script generation and bug analysis. -Remarkable proficiency in anthropology and geology, demonstrating deep understanding and reasoning in these specialized fields. -Strong capabilities in quantitative investing. O1 has comprehensive financial knowledge and statistical modeling skills. -Effective performance in social media analysis, including sentiment analysis and emotion recognition. The model excelled particularly in tasks requiring intricate reasoning and knowledge integration across various fields. While some limitations were observed, including occasional errors on simpler problems and challenges with certain highly specialized concepts, the overall results indicate significant progress towards artificial general intelligence. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2409.18486 [cs.CL] (or arXiv:2409.18486v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.18486 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zhengliang Liu [view email] [v1] Fri, 27 Sep 2024 06:57:00 UTC (15,119 KB) Full-text links: Access Paper: View a PDF of the paper titled Evaluation of OpenAI o1: Opportunities and Challenges of AGI, by Tianyang Zhong and 77 other authorsView PDFTeX SourceOther Formats view license Current browse context: cs.CL prev | next new | recent | 2024-09 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
摘要:本综合研究评估了 OpenAI 的 o1-preview 大语言模型在多个复杂推理任务中的表现,涵盖计算机科学、数学、自然科学、医学、语言学和社会科学等多个领域。通过严格的测试,o1-preview 展示了卓越的能力,在从编码挑战到科学推理、从语言处理到创造性问题解决等多个领域中,常常达到或超越人类水平的表现。关键发现包括:- 在解决复杂的竞争性编程问题中,成功率达到 83.3%,超越了许多人类专家。- 在生成连贯且准确的放射学报告方面表现优异,优于其他评估模型。- 在高中水平的数学推理任务中,准确率达到 100%,并提供详细的逐步解决方案。- 在一般和专业领域(如医学)中,具备先进的自然语言推理能力。- 在芯片设计任务中表现出色,在 EDA 脚本生成和错误分析等领域优于专业模型。- 在人类学和地质学方面表现出显著的熟练度,展示了在这些专业领域的深刻理解和推理能力。- 在量化投资方面具备强大的能力。O1 拥有全面的金融知识和统计建模技能。- 在社交媒体分析中表现有效,包括情感分析和情绪识别。该模型在需要跨多个领域进行复杂推理和知识整合的任务中表现尤为突出。尽管观察到一些局限性,包括在简单问题上偶尔出错以及在某些高度专业化的概念上遇到挑战,但总体结果表明在向通用人工智能迈进方面取得了显著进展。

主题:计算与语言 (cs.CL)
引用方式:arXiv:2409.18486 [cs.CL] (或 arXiv:2409.18486v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2409.18486

提交历史:
From: Zhengliang Liu [view email]
[v1] Fri, 27 Sep 2024 06:57:00 UTC (15,119 KB)

全文链接:
访问论文:查看题为《OpenAI o1 评估:AGI 的机遇与挑战》的 PDF 文件,作者为 Tianyang Zhong 及其他 77 位作者
查看 PDF
TeX 源码
其他格式
查看许可证
当前浏览上下文:cs.CL
prev | next
new | recent | 2024-09
更改浏览方式:
cs
参考文献
引文
NASA ADS
Google Scholar
Semantic Scholar
导出 BibTeX 引用
加载中…
BibTeX 格式化引用
加载中…
数据由以下机构提供:
书签
已勾选=“已勾选”>
书目工具
书目和引文工具
书目浏览器
书目浏览器(什么是浏览器?)
Litmaps
Litmaps(什么是 Litmaps?)
scite.ai
scite 智能引文(什么是智能引文?)
代码、数据、媒体
与本文相关的代码、数据和媒体
代码链接
CatalyzeX 代码查找器(什么是 CatalyzeX?)
DagsHub
DagsHub(什么是 DagsHub?)
GotitPub
Gotit.pub(什么是 GotitPub?)
代码链接
Papers with Code(什么是 Papers with Code?)
ScienceCast
ScienceCast(什么是 ScienceCast?)
演示
演示
Replicate
Replicate(什么是 Replicate?)
Spaces
Hugging Face Spaces(什么是 Spaces?)
Spaces
TXYZ.AI(什么是 TXYZ.AI?)
相关论文
推荐和搜索工具
影响力花朵链接
影响力花朵(什么是影响力花朵?)
Connected Papers
Connected Papers(什么是 Connected Papers?)
CORE 推荐器
CORE 推荐器(什么是 CORE?)
作者 地点 机构 主题
关于 arXivLabs
arXivLabs:社区合作者的实验项目
arXivLabs 是一个框架,允许合作者在我们的网站上直接开发和分享新的 arXiv 功能。无论是个人还是组织,与 arXivLabs 合作的人都秉承并接受了我们开放、社区、卓越和用户数据隐私的价值观。arXiv 致力于这些价值观,并且只与遵守这些价值观的合作伙伴合作。有一个为 arXiv 社区增加价值的项目想法?了解更多关于 arXivLabs 的信息。
本文的哪些作者是支持者?
禁用 MathJax(什么是 MathJax?)
mathjaxToggle();
关于帮助
联系 arXiv
点击此处联系 arXiv
联系
订阅 arXiv 邮件列表
点击此处订阅
订阅
版权隐私政策
网页无障碍辅助
arXiv 运营状态
通过以下方式获取状态通知
电子邮件
或 Slack

[NLP-31] URIEL: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base

【速读】: 该论文试图解决URIEL知识库在语言覆盖范围和用户体验方面的局限性问题。解决方案的关键在于引入URIEL+,这是一个增强版的URIEL和lang2vec工具,通过扩展2898种语言的类型学特征覆盖范围,并提供更强大、可定制的距离计算功能,从而提升用户体验,使其更符合用户需求,并在下游任务中表现出更强的竞争力,同时更好地与语言距离研究结果相吻合。

链接: https://arxiv.org/abs/2409.18472
作者: Aditya Khan,Mason Shipton,David Anugraha,Kaiyao Duan,Phuong H. Hoang,Eric Khiu,A. Seza Doğruöz,En-Shiun Annie Lee
关键词-EN: base offering geographical, knowledge base offering, offering geographical, languages, knowledge base
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:URIEL is a knowledge base offering geographical, phylogenetic, and typological vector representations for 7970 languages. It includes distance measures between these vectors for 4005 languages, which are accessible via the lang2vec tool. Despite being frequently cited, URIEL is limited in terms of linguistic inclusion and overall usability. To tackle these challenges, we introduce URIEL+, an enhanced version of URIEL and lang2vec addressing these limitations. In addition to expanding typological feature coverage for 2898 languages, URIEL+ improves user experience with robust, customizable distance calculations to better suit the needs of the users. These upgrades also offer competitive performance on downstream tasks and provide distances that better align with linguistic distance studies.
摘要:URIEL 是一个知识库,提供了 7970 种语言的地理、系统发育和类型学向量表示。它包括 4005 种语言之间的距离度量,这些度量可以通过 lang2vec 工具访问。尽管 URIEL 经常被引用,但在语言覆盖范围和整体可用性方面存在局限性。为了应对这些挑战,我们推出了 URIEL+,这是 URIEL 和 lang2vec 的增强版本,旨在解决这些局限性。除了扩展 2898 种语言的类型学特征覆盖范围外,URIEL+ 还通过提供强大且可定制的距离计算功能,改善了用户体验,更好地满足了用户的需求。这些升级在下游任务中表现出色,并提供了更符合语言距离研究结果的距离度量。

[NLP-32] Leveraging Long-Context Large Language Models for Multi-Document Understanding and Summarization in Enterprise Applications

【速读】: 该论文试图解决多文档理解和摘要中的关键问题,即传统方法难以捕捉相关上下文、保持逻辑一致性并从长文档中提取重要信息。解决方案的关键在于利用长上下文大语言模型(LLMs)进行多文档摘要,这些模型能够有效把握广泛联系、生成连贯的摘要,并适应不同行业领域及企业应用系统的集成。论文通过在法律、人力资源、财务、采购、医疗和新闻等领域的案例研究,展示了长上下文LLMs在提高摘要效率和准确性方面的显著优势,并探讨了技术挑战和未来研究方向。

链接: https://arxiv.org/abs/2409.18454
作者: Aditi Godbole,Jabin Geevarghese George,Smita Shandilya
关键词-EN: made multi-document comprehension, critical task, Long-context Large Language, rapid increase, increase in unstructured
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid increase in unstructured data across various fields has made multi-document comprehension and summarization a critical task. Traditional approaches often fail to capture relevant context, maintain logical consistency, and extract essential information from lengthy documents. This paper explores the use of Long-context Large Language Models (LLMs) for multi-document summarization, demonstrating their exceptional capacity to grasp extensive connections, provide cohesive summaries, and adapt to various industry domains and integration with enterprise applications/systems. The paper discusses the workflow of multi-document summarization for effectively deploying long-context LLMs, supported by case studies in legal applications, enterprise functions such as HR, finance, and sourcing, as well as in the medical and news domains. These case studies show notable enhancements in both efficiency and accuracy. Technical obstacles, such as dataset diversity, model scalability, and ethical considerations like bias mitigation and factual accuracy, are carefully analyzed. Prospective research avenues are suggested to augment the functionalities and applications of long-context LLMs, establishing them as pivotal tools for transforming information processing across diverse sectors and enterprise applications.
摘要:随着各领域非结构化数据的快速增长,多文档理解和摘要已成为一项关键任务。传统方法往往难以捕捉相关上下文、保持逻辑一致性,并从冗长的文档中提取关键信息。本文探讨了利用长上下文大语言模型 (LLM) 进行多文档摘要的方法,展示了其在把握广泛联系、提供连贯摘要以及适应各行业领域和与企业应用/系统集成方面的卓越能力。本文讨论了有效部署长上下文 LLM 的多文档摘要工作流程,并通过法律应用、企业职能(如人力资源、财务和采购)以及医疗和新闻领域的案例研究进行支持。这些案例研究表明,在效率和准确性方面均有显著提升。本文还仔细分析了技术障碍,如数据集多样性、模型可扩展性以及偏见缓解和事实准确性等伦理考量。最后,本文提出了增强长上下文 LLM 功能和应用的未来研究方向,确立其在跨不同行业和企业应用中转变信息处理的关键工具地位。

[NLP-33] Exploring Language Model Generalization in Low-Resource Extractive QA

【速读】: 该论文试图解决在大语言模型(LLMs)应用于封闭领域(如医学和法律)的抽取式问答(EQA)任务时,模型在零样本学习(zero-shot)情况下泛化能力不足的问题。解决方案的关键在于通过一系列实验揭示了LLMs在处理封闭领域数据集时的性能瓶颈,包括难以检索长答案跨度、难以区分领域特定词汇的含义、模型参数扩展对跨领域泛化的效果有限,以及封闭领域数据集与开放领域EQA数据集在数量上的显著差异。这些发现为改进现有LLMs提供了重要方向。

链接: https://arxiv.org/abs/2409.18446
作者: Saptarshi Sengupta,Wenpeng Yin,Preslav Nakov,Shreya Ghosh,Suhang Wang
关键词-EN: Extractive Question Answering, investigate Extractive Question, Large Language Models, Question Answering, additional in-domain training
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we investigate Extractive Question Answering (EQA) with Large Language Models (LLMs) under domain drift, i.e., can LLMs generalize well to closed-domains that require specific knowledge such as medicine and law in a zero-shot fashion without additional in-domain training? To this end, we devise a series of experiments to empirically explain the performance gap. Our findings suggest that: a) LLMs struggle with dataset demands of closed-domains such as retrieving long answer-spans; b) Certain LLMs, despite showing strong overall performance, display weaknesses in meeting basic requirements as discriminating between domain-specific senses of words which we link to pre-processing decisions; c) Scaling model parameters is not always effective for cross-domain generalization; and d) Closed-domain datasets are quantitatively much different than open-domain EQA datasets and current LLMs struggle to deal with them. Our findings point out important directions for improving existing LLMs.
摘要:本文探讨了在大语言模型 (LLM) 下,提取式问答 (EQA) 在领域漂移 (domain drift) 中的表现,即 LLM 能否在没有额外领域内训练的情况下,以零样本 (zero-shot) 方式很好地泛化到需要特定知识(如医学和法律)的封闭领域?为此,我们设计了一系列实验,以实证解释性能差距。我们的研究发现:a) LLM 在满足封闭领域数据集需求方面存在困难,例如检索长答案跨度;b) 尽管某些 LLM 总体表现强劲,但在满足基本要求方面显示出弱点,例如区分领域特定词汇的含义,这与预处理决策有关;c) 扩展模型参数并不总是对跨领域泛化有效;d) 封闭领域数据集在数量上与开放领域 EQA 数据集有很大不同,当前的 LLM 难以处理这些数据集。我们的研究结果指出了改进现有 LLM 的重要方向。

[NLP-34] Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization NEURIPS2024

【速读】: 该论文试图解决语言模型(LLMs)在处理从简单到复杂的任务时缺乏细粒度难度标注数据集的问题。解决方案的关键在于提出了Easy2Hard-Bench,这是一个包含6个跨领域基准数据集的集合,每个问题都带有数值难度评分。通过收集人类和LLMs在实际尝试中的表现数据,并利用项目反应理论(IRT)和Glicko-2模型等成熟的难度评估系统,为问题分配统一的数值难度评分。此外,该数据集相较于以往的集合,具有更高比例的挑战性问题,旨在通过实验分析当前最先进的LLMs在不同难度级别上的表现和泛化能力,以推动未来LLM泛化研究的发展。

链接: https://arxiv.org/abs/2409.18433
作者: Mucong Ding,Chenghao Deng,Jocelyn Choo,Zichu Wu,Aakriti Agrawal,Avi Schwarzschild,Tianyi Zhou,Tom Goldstein,John Langford,Anima Anandkumar,Furong Huang
关键词-EN: profile language models, fine-grained difficulty annotations, tasks from easy, easy to hard, hard is crucial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: NeurIPS 2024 Datasets and Benchmarks Track

点击查看摘要

Abstract:While generalization over tasks from easy to hard is crucial to profile language models (LLMs), the datasets with fine-grained difficulty annotations for each problem across a broad range of complexity are still blank. Aiming to address this limitation, we present Easy2Hard-Bench, a consistently formatted collection of 6 benchmark datasets spanning various domains, such as mathematics and programming problems, chess puzzles, and reasoning questions. Each problem within these datasets is annotated with numerical difficulty scores. To systematically estimate problem difficulties, we collect abundant performance data on attempts to each problem by humans in the real world or LLMs on the prominent leaderboard. Leveraging the rich performance data, we apply well-established difficulty ranking systems, such as Item Response Theory (IRT) and Glicko-2 models, to uniformly assign numerical difficulty scores to problems. Moreover, datasets in Easy2Hard-Bench distinguish themselves from previous collections by a higher proportion of challenging problems. Through extensive experiments with six state-of-the-art LLMs, we provide a comprehensive analysis of their performance and generalization capabilities across varying levels of difficulty, with the aim of inspiring future research in LLM generalization. The datasets are available at this https URL.
摘要:尽管在从简单到复杂的任务中进行泛化对于评估大语言模型 (LLM) 至关重要,但目前仍缺乏对广泛复杂性范围内每个问题进行细粒度难度标注的数据集。为了解决这一局限性,我们提出了 Easy2Hard-Bench,这是一个包含 6 个基准数据集的统一格式集合,涵盖了数学和编程问题、国际象棋谜题以及推理问题等多个领域。这些数据集中的每个问题都标注了数值难度评分。为了系统地估计问题的难度,我们收集了人类在现实世界中或 LLM 在知名排行榜上尝试解决每个问题的丰富表现数据。利用这些丰富的表现数据,我们应用了成熟的难度排序系统,如项目反应理论 (IRT) 和 Glicko-2 模型,为问题统一分配数值难度评分。此外,Easy2Hard-Bench 中的数据集通过更高比例的挑战性问题,与之前的集合区分开来。通过与六个最先进的 LLM 进行广泛实验,我们对其在不同难度级别上的表现和泛化能力进行了全面分析,旨在激发未来在 LLM 泛化方面的研究。数据集可通过此 https URL 获取。

[NLP-35] Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking

链接: https://arxiv.org/abs/2409.18428
作者: Brian Yan,Vineel Pratap,Shinji Watanabe,Michael Auli
关键词-EN:
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

[NLP-36] VickreyFeedback: Cost-efficient Data Construction for Reinforcement Learning from Human Feedback

链接: https://arxiv.org/abs/2409.18417
作者: Guoxi Zhang,Jiuding Duan
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); General Economics (econ.GN)
备注: 16 pages, 5 figures

点击查看摘要

[NLP-37] SciDFM: A Large Language Model with Mixture-of-Experts for Science

【速读】: 该论文试图解决现有大型语言模型(LLMs)在科学领域,特别是化学分子和氨基酸序列等特定领域知识上的不足。解决方案的关键在于引入了一个名为SciDFM的混合专家模型(mixture-of-experts LLM),该模型从零开始训练,能够进行大学水平的科学推理并理解分子和氨基酸序列。通过收集包含多学科科学论文和书籍的大规模训练语料库,以及特定领域的数据库数据,并进一步在大量指令数据上微调预训练模型,SciDFM在通用科学基准(如SciEval和SciQ)和特定领域基准上均表现出色,达到了同类模型中的最先进水平。

链接: https://arxiv.org/abs/2409.18412
作者: Liangtai Sun,Danyu Luo,Da Ma,Zihan Zhao,Baocai Chen,Zhennan Shen,Su Zhu,Lu Chen,Xin Chen,Kai Yu
关键词-EN: leveraging large language, assist scientific discovery, amino acid sequences, large language models, significant upsurge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 1 figure, 9 tables. Technical Report, Under Review

点击查看摘要

Abstract:Recently, there has been a significant upsurge of interest in leveraging large language models (LLMs) to assist scientific discovery. However, most LLMs only focus on general science, while they lack domain-specific knowledge, such as chemical molecules and amino acid sequences. To bridge these gaps, we introduce SciDFM, a mixture-of-experts LLM, which is trained from scratch and is able to conduct college-level scientific reasoning and understand molecules and amino acid sequences. We collect a large-scale training corpus containing numerous scientific papers and books from different disciplines as well as data from domain-specific databases. We further fine-tune the pre-trained model on lots of instruction data to improve performances on downstream benchmarks. From experiment results, we show that SciDFM achieves strong performance on general scientific benchmarks such as SciEval and SciQ, and it reaches a SOTA performance on domain-specific benchmarks among models of similar size. We further analyze the expert layers and show that the results of expert selection vary with data from different disciplines. To benefit the broader research community, we open-source SciDFM at this https URL.
摘要:近年来,利用大语言模型 (LLM) 辅助科学发现的研究兴趣显著增加。然而,大多数 LLM 仅关注一般科学领域,缺乏特定领域的知识,如化学分子和氨基酸序列。为了填补这些空白,我们推出了 SciDFM,一个混合专家型 LLM,该模型从零开始训练,能够进行大学水平的科学推理,并理解分子和氨基酸序列。我们收集了一个大规模的训练语料库,包含来自不同学科的众多科学论文和书籍,以及来自特定领域数据库的数据。我们进一步在大量指令数据上对预训练模型进行微调,以提升其在下游基准测试中的表现。实验结果表明,SciDFM 在一般科学基准测试如 SciEval 和 SciQ 上表现出色,并在同类模型中在特定领域基准测试上达到了最先进的性能。我们进一步分析了专家层,并展示了专家选择的结果随不同学科数据的变化而变化。为了惠及更广泛的研究社区,我们在 https URL 上开源了 SciDFM。

[NLP-38] Defect Prediction with Content-based Features

【速读】: 该论文试图解决传统缺陷预测方法依赖于代码复杂度指标(如代码行数)的局限性,提出了一种基于源代码内容的新方法。解决方案的关键在于从源代码中提取内容相关的特征(如词汇、主题、数据类型和包名),并假设这些特征能够反映软件系统的技术层面及其缺陷倾向性。通过广泛的实证评估,研究者发现内容特征比代码复杂度指标具有更高的预测能力,并且特征选择、降维和组合技术的应用进一步提升了预测性能。

链接: https://arxiv.org/abs/2409.18365
作者: Hung Viet Pham,Tung Thanh Nguyen
关键词-EN: Traditional defect prediction, Traditional defect, design or implementing, number of lines, source code
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Traditional defect prediction approaches often use metrics that measure the complexity of the design or implementing code of a software system, such as the number of lines of code in a source file. In this paper, we explore a different approach based on content of source code. Our key assumption is that source code of a software system contains information about its technical aspects and those aspects might have different levels of defect-proneness. Thus, content-based features such as words, topics, data types, and package names extracted from a source code file could be used to predict its defects. We have performed an extensive empirical evaluation and found that: i) such content-based features have higher predictive power than code complexity metrics and ii) the use of feature selection, reduction, and combination further improves the prediction performance.
摘要:传统的缺陷预测方法通常使用度量软件系统设计或实现代码复杂性的指标,例如源文件中的代码行数。本文中,我们探索了一种基于源代码内容的不同方法。我们的关键假设是,软件系统的源代码包含了其技术方面的信息,而这些方面可能具有不同程度的缺陷倾向性。因此,从源代码文件中提取的内容特征,如单词、主题、数据类型和包名,可以用于预测其缺陷。我们进行了广泛的实证评估,并发现:i) 这些基于内容的特征比代码复杂性度量具有更高的预测能力;ii) 特征选择、降维和组合的使用进一步提高了预测性能。

[NLP-39] MultiClimate: Multimodal Stance Detection on Climate Change Videos

【速读】: 该论文试图解决气候变化(CC)立场检测在多模态数据中的挑战,特别是缺乏可靠数据集的问题。解决方案的关键在于提出了MultiClimate,这是一个首个开源的手动标注的CC立场检测数据集,包含100个CC相关的YouTube视频和4,209个帧-转录对。通过部署先进的视觉和语言模型以及多模态模型,论文展示了文本模态(如BERT)在立场检测中的显著优势,并证明了结合文本和图像模态可以实现最先进的性能(准确率/F1值为0.747/0.749)。此外,论文还指出,尽管大型语言模型在多模态立场检测中表现不佳,但多模态融合模型在处理此类任务时仍具有挑战性。

链接: https://arxiv.org/abs/2409.18346
作者: Jiawen Wang,Longfei Zuo,Siyao Peng,Barbara Plank
关键词-EN: attracted increasing attention, Climate change, attention in NLP, NLP in recent, recent years
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 1 figure

点击查看摘要

Abstract:Climate change (CC) has attracted increasing attention in NLP in recent years. However, detecting the stance on CC in multimodal data is understudied and remains challenging due to a lack of reliable datasets. To improve the understanding of public opinions and communication strategies, this paper presents MultiClimate, the first open-source manually-annotated stance detection dataset with 100 CC-related YouTube videos and 4,209 frame-transcript pairs. We deploy state-of-the-art vision and language models, as well as multimodal models for MultiClimate stance detection. Results show that text-only BERT significantly outperforms image-only ResNet50 and ViT. Combining both modalities achieves state-of-the-art, 0.747 / 0.749 in accuracy/F1. Our 100M-sized fusion models also beat CLIP and BLIP, as well as the much larger 9B-sized multimodal IDEFICS and text-only Llama3 and Gemma2, indicating that multimodal stance detection remains challenging for large language models. Our code, dataset, as well as supplementary materials, are available at this https URL.
摘要:气候变化 (Climate Change, CC) 近年来在自然语言处理 (NLP) 领域引起了越来越多的关注。然而,在多模态数据中检测对 CC 的立场仍处于研究初期,且由于缺乏可靠的数据集而面临挑战。为了提升对公众意见和传播策略的理解,本文提出了 MultiClimate,这是首个开源的手动标注立场检测数据集,包含 100 个与 CC 相关的 YouTube 视频和 4,209 个帧-转录对。我们部署了最先进的视觉和语言模型,以及多模态模型用于 MultiClimate 的立场检测。结果显示,仅使用文本的 BERT 显著优于仅使用图像的 ResNet50 和 ViT。结合两种模态的模型达到了最先进的水平,准确率/F1 值分别为 0.747/0.749。我们的 100M 规模的融合模型也超越了 CLIP 和 BLIP,以及更大规模的 9B 多模态 IDEFICS 和仅文本的 Llama3 及 Gemma2,这表明多模态立场检测对大语言模型仍具有挑战性。我们的代码、数据集及相关补充材料可在以下链接获取:https URL。

[NLP-40] A Generalized LLM-Augmented BIM Framework: Application to a Speech-to-BIM system

【速读】: 该论文试图解决建筑信息模型(BIM)任务中因记忆大量命令序列而导致的复杂性和高认知负荷问题。解决方案的关键在于提出一个基于大语言模型(LLM)增强的BIM框架,通过自然语言处理(文本或语音)来替代传统的图形用户界面,从而简化BIM任务的操作流程。该框架包括六个步骤:解释、填充、匹配、结构化、执行和检查,通过这一流程,可以加速开发基于LLM的BIM应用,如论文中展示的语音驱动的BIM应用NADIA-S。

链接: https://arxiv.org/abs/2409.18345
作者: Ghang Lee,Suhyung Jang,Seokho Hyun
关键词-EN: Performing building information, building information modeling, steep learning curve, heavy cognitive load, cognitive load due
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Performing building information modeling (BIM) tasks is a complex process that imposes a steep learning curve and a heavy cognitive load due to the necessity of remembering sequences of numerous commands. With the rapid advancement of large language models (LLMs), it is foreseeable that BIM tasks, including querying and managing BIM data, 4D and 5D BIM, design compliance checking, or authoring a design, using written or spoken natural language (i.e., text-to-BIM or speech-to-BIM), will soon supplant traditional graphical user interfaces. This paper proposes a generalized LLM-augmented BIM framework to expedite the development of LLM-enhanced BIM applications by providing a step-by-step development process. The proposed framework consists of six steps: interpret-fill-match-structure-execute-check. The paper demonstrates the applicability of the proposed framework through implementing a speech-to-BIM application, NADIA-S (Natural-language-based Architectural Detailing through Interaction with Artificial Intelligence via Speech), using exterior wall detailing as an example.
摘要:执行建筑信息建模 (BIM) 任务是一个复杂的过程,由于需要记忆大量命令序列,因此具有陡峭的学习曲线和沉重的认知负担。随着大语言模型 (LLM) 的快速发展,可以预见,包括查询和管理 BIM 数据、4D 和 5D BIM、设计合规性检查或设计创作在内的 BIM 任务,将很快取代传统的图形用户界面,使用书面或口头的自然语言(即文本到 BIM 或语音到 BIM)。本文提出了一种通用的 LLM 增强 BIM 框架,通过提供一个逐步的开发过程,加速 LLM 增强 BIM 应用程序的开发。该框架包括六个步骤:解释-填充-匹配-结构化-执行-检查。本文通过实现一个语音到 BIM 应用程序,NADIA-S(通过语音与人工智能交互实现基于自然语言的建筑细节设计),以立面细节设计为例,展示了所提出框架的适用性。

[NLP-41] AER-LLM: Ambiguity-aware Emotion Recognition Leveraging Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)在情感识别中对复杂和模糊情感处理不足的问题。解决方案的关键在于利用LLMs的强大泛化能力和上下文学习能力,通过设计零样本和少样本提示方法,并结合过往对话作为上下文信息,来提升对模糊情感的识别能力。实验结果表明,这种方法显著提升了LLMs在识别模糊情感方面的潜力,并强调了上下文信息在情感识别中的重要性。

链接: https://arxiv.org/abs/2409.18339
作者: Xin Hong,Yuan Gong,Vidhyasaharan Sethu,Ting Dang
关键词-EN: Large Language Models, Natural Language Processing, Language Models, Language Processing, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 4 figures

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have demonstrated great success in many Natural Language Processing (NLP) tasks. In addition to their cognitive intelligence, exploring their capabilities in emotional intelligence is also crucial, as it enables more natural and empathetic conversational AI. Recent studies have shown LLMs’ capability in recognizing emotions, but they often focus on single emotion labels and overlook the complex and ambiguous nature of human emotions. This study is the first to address this gap by exploring the potential of LLMs in recognizing ambiguous emotions, leveraging their strong generalization capabilities and in-context learning. We design zero-shot and few-shot prompting and incorporate past dialogue as context information for ambiguous emotion recognition. Experiments conducted using three datasets indicate significant potential for LLMs in recognizing ambiguous emotions, and highlight the substantial benefits of including context information. Furthermore, our findings indicate that LLMs demonstrate a high degree of effectiveness in recognizing less ambiguous emotions and exhibit potential for identifying more ambiguous emotions, paralleling human perceptual capabilities.
摘要:近年来,大语言模型 (Large Language Models, LLMs) 在许多自然语言处理 (Natural Language Processing, NLP) 任务中展示了巨大的成功。除了认知智能,探索其在情感智能方面的能力同样至关重要,因为它能够实现更自然和富有同理心的对话式 AI。最近的研究表明,LLMs 具备识别情感的能力,但这些研究往往聚焦于单一情感标签,而忽视了人类情感的复杂性和模糊性。本研究首次填补了这一空白,通过探索 LLMs 在识别模糊情感方面的潜力,利用其强大的泛化能力和上下文学习能力。我们设计了零样本 (zero-shot) 和少样本 (few-shot) 提示,并将过去的对话作为上下文信息用于模糊情感识别。使用三个数据集进行的实验表明,LLMs 在识别模糊情感方面具有显著潜力,并突显了包含上下文信息的重要益处。此外,我们的研究结果表明,LLMs 在识别较少模糊情感方面表现出高度有效性,并展现出识别更多模糊情感的潜力,这与人类的感知能力相平行。

[NLP-42] A Fairness-Driven Method for Learning Human-Compatible Negotiation Strategies EMNLP

【速读】: 该论文试图解决AI在谈判领域中的挑战,特别是在处理非零和博弈时,传统基于博弈论的方法难以学习与人类兼容的策略,而仅依赖人类数据的方法则缺乏理论保证。论文提出的解决方案关键在于引入公平性作为优化标准,通过名为FDHC的谈判框架,结合奖励设计和搜索策略,学习与人类兼容的谈判策略。特别地,论文提出了一种新颖的RL+搜索技术LGM-Zero,利用预训练语言模型从大规模动作空间中检索与人类兼容的提议,从而实现更平等的谈判结果并提升谈判质量。

链接: https://arxiv.org/abs/2409.18335
作者: Ryan Shea,Zhou Yu
关键词-EN: recent advancements, remains a difficult, difficult domain, NLP, negotiation
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: EMNLP Findings 2024

点击查看摘要

Abstract:Despite recent advancements in AI and NLP, negotiation remains a difficult domain for AI agents. Traditional game theoretic approaches that have worked well for two-player zero-sum games struggle in the context of negotiation due to their inability to learn human-compatible strategies. On the other hand, approaches that only use human data tend to be domain-specific and lack the theoretical guarantees provided by strategies grounded in game theory. Motivated by the notion of fairness as a criterion for optimality in general sum games, we propose a negotiation framework called FDHC which incorporates fairness into both the reward design and search to learn human-compatible negotiation strategies. Our method includes a novel, RL+search technique called LGM-Zero which leverages a pre-trained language model to retrieve human-compatible offers from large action spaces. Our results show that our method is able to achieve more egalitarian negotiation outcomes and improve negotiation quality.
摘要:尽管近年来人工智能 (AI) 和自然语言处理 (NLP) 取得了显著进展,但谈判领域对 AI 智能体来说仍然是一个难题。传统的博弈论方法在双人零和游戏中表现良好,但在谈判背景下却难以应对,因为它们无法学习与人类兼容的策略。另一方面,仅依赖人类数据的方法往往局限于特定领域,并且缺乏基于博弈论策略所提供的理论保障。受公平性作为一般和游戏中最优性标准的启发,我们提出了一种名为 FDHC 的谈判框架,该框架将公平性融入奖励设计和搜索中,以学习与人类兼容的谈判策略。我们的方法包括一种新颖的 RL+搜索技术,称为 LGM-Zero,它利用预训练的语言模型从大规模动作空间中检索与人类兼容的报价。我们的研究结果表明,该方法能够实现更加平等的谈判结果,并提高谈判质量。

[NLP-43] Cross-Institutional Structured Radiology Reporting for Lung Cancer Screening Using a Dynamic Template-Constrained Large Language Model

【速读】: 该论文旨在解决当前大型语言模型(LLMs)在生成结构化放射学报告时面临的格式错误、内容幻觉和隐私泄露问题。解决方案的关键在于开发了一种增强的开源LLM,通过模板约束解码(template-constrained decoding)技术,从自由文本描述中生成结构化和标准化的肺结节报告。研究团队设计了一个包含29个特征的标准化报告模板,并在此基础上对LLAMA、Qwen和Mistral等开源LLM进行了改进。实验结果表明,该方法在多机构数据集上显著提升了LLM的性能,减少了格式错误和内容幻觉,且无需将数据上传至外部服务器,从而避免了隐私泄露风险。此外,研究还成功构建了一个基于增强LLM技术的结节级检索系统,并进行了自动统计分析,验证了其与现有研究结果的高度一致性。

链接: https://arxiv.org/abs/2409.18319
作者: Chuang Niu,Parisa Kaviani,Qing Lyu,Mannudeep K. Kalra,Christopher T. Whitlow,Ge Wang
关键词-EN: optimizing clinical workflows, Structured radiology reporting, patient outcomes, advantageous for optimizing, optimizing clinical
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Structured radiology reporting is advantageous for optimizing clinical workflows and patient outcomes. Current LLMs in creating structured reports face the challenges of formatting errors, content hallucinations, and privacy leakage concerns when uploaded to external servers. We aim to develop an enhanced open-source LLM for creating structured and standardized LCS reports from free-text descriptions. After institutional IRB approvals, 5,442 de-identified LCS reports from two institutions were retrospectively analyzed. 500 reports were randomly selected from the two institutions evenly and then manually labeled for evaluation. Two radiologists from the two institutions developed a standardized template including 29 features for lung nodule reporting. We proposed template-constrained decoding to enhance state-of-the-art open-source LLMs, including LLAMA, Qwen, and Mistral. The LLM performance was extensively evaluated in terms of F1 score, confidence interval, McNemar test, and z-test. Based on the structured reports created from the large-scale dataset, a nodule-level retrieval system was prototyped and an automatic statistical analysis was performed. Our software, vLLM-structure, is publicly available for local deployment with enhanced LLMs. Our template-constrained decoding approach consistently enhanced the LLM performance on multi-institutional datasets, with neither formatting errors nor content hallucinations. Our method improved the best open-source LLAMA-3.1 405B by up to 10.42%, and outperformed GPT-4o by 17.19%. A novel nodule retrieval system was successfully prototyped and demonstrated on a large-scale multimodal database using our enhanced LLM technologies. The automatically derived statistical distributions were closely consistent with the prior findings in terms of nodule type, location, size, status, and Lung-RADS.
摘要:结构化的放射学报告在优化临床工作流程和患者结果方面具有优势。当前的大语言模型 (LLM) 在创建结构化报告时面临格式错误、内容幻觉以及上传至外部服务器时的隐私泄露问题。我们的目标是开发一个增强的开源大语言模型,用于从自由文本描述中生成结构化和标准化的 LCS 报告。在获得机构 IRB 批准后,我们对来自两家机构的 5,442 份去识别化的 LCS 报告进行了回顾性分析。从两家机构中均匀随机选择了 500 份报告,并进行了手动标注以进行评估。两家机构的两位放射科医生开发了一个包含 29 个特征的标准化模板,用于肺结节报告。我们提出了模板约束解码方法,以增强包括 LLAMA、Qwen 和 Mistral 在内的最先进的开源大语言模型。我们对大语言模型的性能进行了广泛评估,包括 F1 分数、置信区间、McNemar 检验和 z 检验。基于从大规模数据集生成的结构化报告,我们原型化了一个结节级检索系统,并进行了自动统计分析。我们的软件 vLLM-structure 公开可用,支持本地部署增强的大语言模型。我们的模板约束解码方法在多机构数据集上持续提升了大语言模型的性能,且无格式错误或内容幻觉。我们的方法将最佳开源 LLAMA-3.1 405B 的性能提升了高达 10.42%,并超越了 GPT-4o 17.19%。我们成功原型化并展示了一个新颖的结节检索系统,该系统使用我们增强的大语言模型技术在大规模多模态数据库上运行。自动推导的统计分布在结节类型、位置、大小、状态和 Lung-RADS 方面与先前研究结果高度一致。

[NLP-44] Realistic Evaluation of Model Merging for Compositional Generalization

【速读】: 该论文旨在解决模型合并方法在不同实验设置和假设下的相对优劣问题。解决方案的关键在于通过统一的实验设置,评估不同合并方法在图像分类、图像生成和自然语言处理中的组合泛化能力,并精确识别每种方法的实际需求。此外,论文还测量了不同合并方法的计算成本及其在合并多个模型时的表现,从而为模型合并领域的研究提供了全面且严谨的实验框架。

链接: https://arxiv.org/abs/2409.18314
作者: Derek Tam,Yash Kant,Brian Lester,Igor Gilitschenski,Colin Raffel
关键词-EN: cheaply combine individual, combine individual models, attains better performance, cheaply combine, combine individual
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Merging has become a widespread way to cheaply combine individual models into a single model that inherits their capabilities and attains better performance. This popularity has spurred rapid development of many new merging methods, which are typically validated in disparate experimental settings and frequently differ in the assumptions made about model architecture, data availability, and computational budget. In this work, we characterize the relative merits of different merging methods by evaluating them in a shared experimental setting and precisely identifying the practical requirements of each method. Specifically, our setting focuses on using merging for compositional generalization of capabilities in image classification, image generation, and natural language processing. Additionally, we measure the computational costs of different merging methods as well as how they perform when scaling the number of models being merged. Taken together, our results clarify the state of the field of model merging and provide a comprehensive and rigorous experimental setup to test new methods.
摘要:合并已成为一种广泛使用的方法,通过将多个独立模型廉价地组合成一个继承其能力并实现更好性能的单一模型。这种流行趋势推动了多种新型合并方法的快速发展,这些方法通常在不同的实验环境中进行验证,并且在模型架构、数据可用性和计算预算的假设上经常存在差异。在本研究中,我们通过在一个共享的实验环境中评估这些方法,并精确识别每种方法的实际需求,来评估不同合并方法的相对优劣。具体而言,我们的实验环境专注于利用合并方法在图像分类、图像生成和自然语言处理中实现能力的组合泛化。此外,我们还测量了不同合并方法的计算成本,以及在扩展合并模型数量时的表现。综上所述,我们的研究结果明确了模型合并领域的现状,并提供了一个全面且严谨的实验设置,以测试新的方法。

[NLP-45] Advancing Object Detection in Transportation with Multimodal Large Language Models (MLLMs): A Comprehensive Review and Empirical Testing

【速读】: 该论文旨在全面评估多模态大语言模型(MLLMs)和大视觉模型(VLMs)在交通系统中物体检测的应用。解决方案的关键在于通过实证分析,测试MLLMs在三个实际交通问题中的物体检测任务,包括道路安全属性提取、安全关键事件检测和热成像视觉推理。研究通过详细评估MLLMs的性能,揭示其优势和改进空间,并讨论了在交通领域增强物体检测的实际限制和挑战,为未来的研究和开发提供了路线图。

链接: https://arxiv.org/abs/2409.18286
作者: Huthaifa I. Ashqar,Ahmed Jaber,Taqwa I. Alhadidi,Mohammed Elhenawy
关键词-EN: Large Vision Models, large language models, multimodal large language, Vision Models, Large Vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study aims to comprehensively review and empirically evaluate the application of multimodal large language models (MLLMs) and Large Vision Models (VLMs) in object detection for transportation systems. In the first fold, we provide a background about the potential benefits of MLLMs in transportation applications and conduct a comprehensive review of current MLLM technologies in previous studies. We highlight their effectiveness and limitations in object detection within various transportation scenarios. The second fold involves providing an overview of the taxonomy of end-to-end object detection in transportation applications and future directions. Building on this, we proposed empirical analysis for testing MLLMs on three real-world transportation problems that include object detection tasks namely, road safety attributes extraction, safety-critical event detection, and visual reasoning of thermal images. Our findings provide a detailed assessment of MLLM performance, uncovering both strengths and areas for improvement. Finally, we discuss practical limitations and challenges of MLLMs in enhancing object detection in transportation, thereby offering a roadmap for future research and development in this critical area.
摘要:本研究旨在全面回顾并实证评估多模态大语言模型 (MLLMs) 和大视觉模型 (VLMs) 在交通系统中目标检测的应用。首先,我们介绍了 MLLMs 在交通应用中的潜在优势,并对以往研究中的当前 MLLM 技术进行了全面回顾。我们重点介绍了它们在各种交通场景中目标检测的有效性和局限性。其次,我们概述了交通应用中端到端目标检测的分类法及未来方向。在此基础上,我们提出了针对三个现实交通问题的实证分析,包括目标检测任务,即道路安全属性提取、安全关键事件检测和热图像的视觉推理。我们的研究结果详细评估了 MLLM 的性能,揭示了其优势和改进领域。最后,我们讨论了 MLLMs 在提升交通目标检测中的实际局限性和挑战,从而为该关键领域的未来研究和发展提供了路线图。

[NLP-46] DisGeM: Distractor Generation for Multiple Choice Questions with Span Masking

【速读】: 该论文试图解决多选题(MCQ)中干扰项生成的问题,关键在于提出了一种基于预训练语言模型(PLMs)的通用框架。该框架无需额外训练或微调,通过候选生成和候选选择两个阶段,有效生成更具吸引力和有效性的干扰项,超越了以往方法,并通过人工评估验证了其优越性。

链接: https://arxiv.org/abs/2409.18263
作者: Devrim Cavusoglu,Secil Sen,Ulas Sert
关键词-EN: Natural Language Processing, natural language inference, Natural Language, Recent advancements, impacted numerous sub-fields
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements in Natural Language Processing (NLP) have impacted numerous sub-fields such as natural language generation, natural language inference, question answering, and more. However, in the field of question generation, the creation of distractors for multiple-choice questions (MCQ) remains a challenging task. In this work, we present a simple, generic framework for distractor generation using readily available Pre-trained Language Models (PLMs). Unlike previous methods, our framework relies solely on pre-trained language models and does not require additional training on specific datasets. Building upon previous research, we introduce a two-stage framework consisting of candidate generation and candidate selection. Our proposed distractor generation framework outperforms previous methods without the need for training or fine-tuning. Human evaluations confirm that our approach produces more effective and engaging distractors. The related codebase is publicly available at this https URL.
摘要:自然语言处理 (Natural Language Processing, NLP) 的最新进展已经影响了众多子领域,如自然语言生成、自然语言推理、问答系统等。然而,在问题生成领域,为多项选择题 (Multiple-Choice Questions, MCQ) 创建干扰项仍然是一项具有挑战性的任务。本文提出了一种简单、通用的干扰项生成框架,该框架利用现成的预训练语言模型 (Pre-trained Language Models, PLMs)。与以往的方法不同,我们的框架完全依赖于预训练语言模型,无需在特定数据集上进行额外训练。基于先前的研究,我们引入了一个两阶段的框架,包括候选生成和候选选择。我们提出的干扰项生成框架在无需训练或微调的情况下,优于以往的方法。人工评估证实,我们的方法生成的干扰项更加有效且更具吸引力。相关代码库已在以下链接公开:https URL。

[NLP-47] MMMT-IF: A Challenging Multimodal Multi-Turn Instruction Following Benchmark

【速读】: 该论文试图解决多模态、多轮对话中指令跟随能力的评估问题,特别是在输入模型上下文中可能存在多个指令的情况下,人工评估耗时且容易产生偏见。解决方案的关键在于提出了MMMT-IF评估集,通过图像和全局指令约束答案格式,挑战模型在长对话中检索分散指令并进行推理的能力。论文还引入了Programmatic Instruction Following (PIF) 指标,通过代码执行客观验证指令遵循情况,并进一步提出了PIF-N-K指标集,评估模型在多次响应中的鲁棒性。实验结果表明,即使是最先进的模型在多轮对话中指令跟随能力也会显著下降,而将所有指令附加到模型输入上下文末尾可以显著提高PIF指标。

链接: https://arxiv.org/abs/2409.18216
作者: Elliot L. Epstein,Kaisheng Yao,Jing Li,Xinyi Bai,Hamid Palangi
关键词-EN: Evaluating instruction, instructions, operatorname, PIF, capabilities for multimodal
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 24 pages, 16 figures

点击查看摘要

Abstract:Evaluating instruction following capabilities for multimodal, multi-turn dialogue is challenging. With potentially multiple instructions in the input model context, the task is time-consuming for human raters and we show LLM based judges are biased towards answers from the same model. We propose MMMT-IF, an image based multi-turn Q \ A evaluation set with added global instructions between questions, constraining the answer format. This challenges models to retrieve instructions dispersed across long dialogues and reason under instruction constraints. All instructions are objectively verifiable through code execution. We introduce the Programmatic Instruction Following ( \operatornamePIF ) metric to measure the fraction of the instructions that are correctly followed while performing a reasoning task. The \operatornamePIF-N-K set of metrics further evaluates robustness by measuring the fraction of samples in a corpus where, for each sample, at least K out of N generated model responses achieve a \operatornamePIF score of one. The \operatornamePIF metric aligns with human instruction following ratings, showing 60 percent correlation. Experiments show Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet, have a \operatornamePIF metric that drops from 0.81 on average at turn 1 across the models, to 0.64 at turn 20. Across all turns, when each response is repeated 4 times ( \operatornamePIF-4-4 ), GPT-4o and Gemini successfully follow all instructions only 11% of the time. When all the instructions are also appended to the end of the model input context, the \operatornamePIF metric improves by 22.3 points on average, showing that the challenge with the task lies not only in following the instructions, but also in retrieving the instructions spread out in the model context. We plan to open source the MMMT-IF dataset and metric computation code.
摘要:评估多模态、多轮对话中的指令遵循能力具有挑战性。由于输入模型上下文中可能包含多个指令,这一任务对人工评分者来说耗时较长,并且我们发现基于大语言模型 (LLM) 的评判者对来自同一模型的答案存在偏见。我们提出了 MMMT-IF,这是一个基于图像的多轮问答评估集,其中在问题之间添加了全局指令,限制了答案格式。这要求模型在长对话中检索分散的指令,并在指令约束下进行推理。所有指令都可以通过代码执行进行客观验证。我们引入了程序化指令遵循 (Programmatic Instruction Following, PIF) 指标,用于衡量在执行推理任务时正确遵循的指令比例。进一步地,PIF-N-K 指标集通过测量语料库中每个样本中至少有 K 个生成的模型响应达到 PIF 分数为 1 的比例,来评估鲁棒性。PIF 指标与人类指令遵循评分相一致,显示出 60% 的相关性。实验表明,Gemini 1.5 Pro、GPT-4o 和 Claude 3.5 Sonnet 的 PIF 指标在模型间的第 1 轮平均为 0.81,到第 20 轮下降至 0.64。在所有轮次中,当每个响应重复 4 次 (PIF-4-4) 时,GPT-4o 和 Gemini 仅在 11% 的情况下成功遵循所有指令。当所有指令也附加到模型输入上下文的末尾时,PIF 指标平均提高了 22.3 分,表明任务的挑战不仅在于遵循指令,还在于检索分散在模型上下文中的指令。我们计划开源 MMMT-IF 数据集和指标计算代码。

[NLP-48] AI Policy Projector: Grounding LLM Policy Design in Iterative Mapmaking

【速读】: 该论文试图解决大型语言模型政策在面对无限多样的现实情境时难以评估覆盖范围的问题。解决方案的关键在于引入了一种受地图制作启发的AI政策设计流程,即Policy Projector。该工具允许政策设计者通过可视化模型输入-输出对的地图,定义自定义区域(如“暴力”),并使用规则导航这些区域以调整LLM输出(如重写包含“暴力”和“详细描述”的输出,去除“详细描述”)。Policy Projector通过支持使用LLM分类和引导的交互式政策编写,以及反映设计者工作的地图可视化,帮助政策设计者应对现有全面危害分类之外的问题模型行为。

链接: https://arxiv.org/abs/2409.18203
作者: Michelle S. Lam,Fred Hohman,Dominik Moritz,Jeffrey P. Bigham,Kenneth Holstein,Mary Beth Kery
关键词-EN: large language model, implicit reward model, large language, explicit constitution, implicit reward
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Whether a large language model policy is an explicit constitution or an implicit reward model, it is challenging to assess coverage over the unbounded set of real-world situations that a policy must contend with. We introduce an AI policy design process inspired by mapmaking, which has developed tactics for visualizing and iterating on maps even when full coverage is not possible. With Policy Projector, policy designers can survey the landscape of model input-output pairs, define custom regions (e.g., “violence”), and navigate these regions with rules that can be applied to LLM outputs (e.g., if output contains “violence” and “graphic details,” then rewrite without “graphic details”). Policy Projector supports interactive policy authoring using LLM classification and steering and a map visualization reflecting the policy designer’s work. In an evaluation with 12 AI safety experts, our system helps policy designers to address problematic model behaviors extending beyond an existing, comprehensive harm taxonomy.
摘要:无论是大语言模型策略是显式的宪法还是隐式的奖励模型,评估其在无界现实情境集合上的覆盖范围都是一项挑战。我们引入了一种受地图制作启发的 AI 策略设计流程,该流程开发了在无法实现全面覆盖的情况下可视化和迭代地图的策略。通过 Policy Projector,策略设计者可以审视模型输入输出对的景观,定义自定义区域(例如,“暴力”),并使用可应用于大语言模型输出的规则在这些区域中导航(例如,如果输出包含“暴力”和“图形细节”,则重写时不包含“图形细节”)。Policy Projector 支持使用大语言模型分类和引导进行交互式策略编写,并提供反映策略设计者工作的地图可视化。在与 12 位 AI 安全专家的评估中,我们的系统帮助策略设计者解决了超出现有全面危害分类的问题模型行为。

[NLP-49] LangSAMP: Language-Script Aware Multilingual Pretraining

【速读】: 该论文试图解决多语言预训练语言模型(mPLMs)中由于避免使用语言嵌入而导致模型难以生成语言中立表示的问题。解决方案的关键在于提出了Language-Script Aware Multilingual Pretraining (LangSAMP)方法,通过在Transformer块的输出中整合语言和脚本嵌入,增强了表示学习,同时保持了模型架构的简洁性。这种方法不仅提升了模型在多语言环境下的表现,还改善了跨语言迁移时源语言的选择。

链接: https://arxiv.org/abs/2409.18199
作者: Yihong Liu,Haotian Ye,Chunlan Ma,Mingyang Wang,Hinrich Schütze
关键词-EN: Recent multilingual pretrained, learnable vectors assigned, Recent multilingual, learnable vectors, vectors assigned
类目: Computation and Language (cs.CL)
备注: preprint

点击查看摘要

Abstract:Recent multilingual pretrained language models (mPLMs) often avoid using language embeddings – learnable vectors assigned to different languages. These embeddings are discarded for two main reasons: (1) mPLMs are expected to have a single, unified parameter set across all languages, and (2) they need to function seamlessly as universal text encoders without requiring language IDs as input. However, this removal increases the burden on token embeddings to encode all language-specific information, which may hinder the model’s ability to produce more language-neutral representations. To address this challenge, we propose Language-Script Aware Multilingual Pretraining (LangSAMP), a method that incorporates both language and script embeddings to enhance representation learning while maintaining a simple architecture. Specifically, we integrate these embeddings into the output of the transformer blocks before passing the final representations to the language modeling head for prediction. We apply LangSAMP to the continual pretraining of XLM-R on a highly multilingual corpus covering more than 500 languages. The resulting model consistently outperforms the baseline. Extensive analysis further shows that language/script embeddings encode language/script-specific information, which improves the selection of source languages for crosslingual transfer. We make our code and models publicly available at \urlthis https URL.
摘要:最近的多语言预训练语言模型 (mPLMs) 通常避免使用语言嵌入 (language embeddings)——即可学习的向量,分配给不同的语言。这些嵌入被丢弃主要有两个原因:(1) mPLMs 期望在所有语言中使用单一、统一的参数集;(2) 它们需要作为通用文本编码器无缝运行,而不需要语言 ID 作为输入。然而,这种去除增加了 Token 嵌入编码所有语言特定信息的负担,这可能阻碍模型生成更多语言中立表示的能力。为了应对这一挑战,我们提出了语言-脚本感知的多语言预训练 (Language-Script Aware Multilingual Pretraining, LangSAMP),这是一种结合了语言和脚本嵌入的方法,以增强表示学习,同时保持简单的架构。具体来说,我们将这些嵌入集成到 Transformer 块的输出中,然后将最终表示传递给语言建模头进行预测。我们将 LangSAMP 应用于 XLM-R 在涵盖超过 500 种语言的高度多语言语料库上的持续预训练。结果模型始终优于基线。广泛的分析进一步表明,语言/脚本嵌入编码了语言/脚本特定的信息,这改进了跨语言迁移的源语言选择。我们公开了代码和模型,可在 \urlthis https URL 获取。

[NLP-50] LowREm: A Repository of Word Embeddings for 87 Low-Resource Languages Enhanced with Multilingual Graph Knowledge

【速读】: 该论文试图解决低资源语言在大型语言模型(LLMs)中覆盖率不足的问题,特别是由于数据稀缺和高计算成本导致难以训练LLMs的情况。解决方案的关键在于提出了LowREm,一个集中存储87种低资源语言静态嵌入的仓库,并通过一种新颖的方法增强基于GloVe的嵌入,该方法整合了多语言图知识,从而提升了这些嵌入在情感分析任务中的表现,超越了从XLM-R中提取的上下文嵌入。

链接: https://arxiv.org/abs/2409.18193
作者: Daniil Gurgurov,Rishu Kumar,Simon Ostermann
关键词-EN: large language models, lower resourced languages, based on large, limited for lower, lower resourced
类目: Computation and Language (cs.CL)
备注: Short paper, preview

点击查看摘要

Abstract:Contextualized embeddings based on large language models (LLMs) are available for various languages, but their coverage is often limited for lower resourced languages. Training LLMs for such languages is often difficult due to insufficient data and high computational cost. Especially for very low resource languages, static word embeddings thus still offer a viable alternative. There is, however, a notable lack of comprehensive repositories with such embeddings for diverse languages. To address this, we present LowREm, a centralized repository of static embeddings for 87 low-resource languages. We also propose a novel method to enhance GloVe-based embeddings by integrating multilingual graph knowledge, utilizing another source of knowledge. We demonstrate the superior performance of our enhanced embeddings as compared to contextualized embeddings extracted from XLM-R on sentiment analysis. Our code and data are publicly available under this https URL.
摘要:基于大语言模型 (LLMs) 的上下文嵌入在多种语言中可用,但对于资源较少的语言,其覆盖范围通常有限。由于数据不足和高计算成本,为这些语言训练 LLMs 往往困难重重。特别是对于资源非常匮乏的语言,静态词嵌入仍然提供了一种可行的替代方案。然而,目前缺乏包含这些嵌入的综合性多语言资源库。为此,我们推出了 LowREm,一个包含 87 种低资源语言静态嵌入的集中式资源库。我们还提出了一种新方法,通过整合多语言图知识来增强基于 GloVe 的嵌入,利用了另一种知识来源。我们展示了在情感分析任务中,我们的增强嵌入相比从 XLM-R 提取的上下文嵌入具有更优越的性能。我们的代码和数据已公开,详见此 https URL。

[NLP-51] Evaluation of Large Language Models for Summarization Tasks in the Medical Domain: A Narrative Review

【速读】: 该论文试图解决临床自然语言生成任务中模型评估的可靠性问题,特别是在医疗文本处理的高风险环境下。解决方案的关键在于提出未来评估方向,以应对专家人工评估的资源限制,从而确保临床摘要任务的评估既可靠又高效。

链接: https://arxiv.org/abs/2409.18170
作者: Emma Croxford,Yanjun Gao,Nicholas Pellegrino,Karen K. Wong,Graham Wills,Elliot First,Frank J. Liao,Cherodeep Goswami,Brian Patterson,Majid Afshar
关键词-EN: Large Language Models, Natural Language Generation, clinical Natural Language, Large Language, Language Generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models have advanced clinical Natural Language Generation, creating opportunities to manage the volume of medical text. However, the high-stakes nature of medicine requires reliable evaluation, which remains a challenge. In this narrative review, we assess the current evaluation state for clinical summarization tasks and propose future directions to address the resource constraints of expert human evaluation.
摘要:大语言模型 (Large Language Model) 在临床自然语言生成 (Natural Language Generation) 方面取得了进展,为管理医疗文本的规模创造了机会。然而,医学领域的高风险性质要求可靠的评估,这仍然是一个挑战。在这篇综述中,我们评估了当前临床总结任务的评估状态,并提出了未来方向,以解决专家人工评估的资源限制问题。

[NLP-52] Data-Prep-Kit: getting your data ready for LLM application development

【速读】: 该论文试图解决大规模语言模型(LLM)开发中的数据准备问题,特别是如何高效、灵活地准备和处理自然语言及代码数据。解决方案的关键在于引入了一个名为Data Prep Kit(DPK)的开源数据准备工具包。DPK具有易用性、可扩展性和可伸缩性,能够根据用户需求在本地机器或集群上进行数据准备,支持从少量CPU到数千CPU的扩展。其核心在于提供了一系列高度可扩展且可扩展的模块,用于处理自然语言和代码数据,用户还可以根据需要轻松开发新的转换模块。这些模块可以独立使用或通过管道组合执行一系列操作,从而显著提升LLM模型的性能或通过检索增强生成(RAG)进行模型微调。

链接: https://arxiv.org/abs/2409.18164
作者: David Wood,Boris Lublinsky,Alexy Roytman,Shivdeep Singh,Abdulhamid Adebayo,Revital Eres,Mohammad Nassar,Hima Patel,Yousaf Shah,Constantin Adam,Petros Zerfos,Nirmit Desai,Daiki Tsuzuku,Takuya Goto,Michele Dolfi,Saptha Surendran,Paramesvaran Selvam,Sungeun An,Yuan Chi Chang,Dhiraj Joshi,Hajar Emami-Gohari,Xuan-Hong Dang,Yan Koyfman,Shahrokh Daijavad
关键词-EN: Data Prep Kit, DPK, Data preparation, Prep Kit, Data
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:Data preparation is the first and a very important step towards any Large Language Model (LLM) development. This paper introduces an easy-to-use, extensible, and scale-flexible open-source data preparation toolkit called Data Prep Kit (DPK). DPK is architected and designed to enable users to scale their data preparation to their needs. With DPK they can prepare data on a local machine or effortlessly scale to run on a cluster with thousands of CPU Cores. DPK comes with a highly scalable, yet extensible set of modules that transform natural language and code data. If the user needs additional transforms, they can be easily developed using extensive DPK support for transform creation. These modules can be used independently or pipelined to perform a series of operations. In this paper, we describe DPK architecture and show its performance from a small scale to a very large number of CPUs. The modules from DPK have been used for the preparation of Granite Models [1] [2]. We believe DPK is a valuable contribution to the AI community to easily prepare data to enhance the performance of their LLM models or to fine-tune models with Retrieval-Augmented Generation (RAG).
摘要:数据准备是任何大语言模型 (LLM) 开发的第一步,也是至关重要的一步。本文介绍了一个易于使用、可扩展且规模灵活的开源数据准备工具包,名为 Data Prep Kit (DPK)。DPK 的架构和设计旨在使用户能够根据需求扩展其数据准备规模。通过 DPK,用户可以在本地机器上准备数据,或轻松扩展到在拥有数千个 CPU 核心的集群上运行。DPK 提供了一套高度可扩展且可扩展的模块,用于转换自然语言和代码数据。如果用户需要额外的转换功能,可以利用 DPK 对转换创建的广泛支持轻松开发。这些模块可以独立使用,也可以通过管道执行一系列操作。本文描述了 DPK 的架构,并展示了其从小规模到大量 CPU 的性能表现。DPK 的模块已用于 Granite 模型的准备 [1] [2]。我们相信,DPK 是对 AI 社区的一项宝贵贡献,能够轻松准备数据以提升其 LLM 模型的性能,或通过检索增强生成 (RAG) 对模型进行微调。

人工智能

[AI-0] PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation ECCV2024

链接: https://arxiv.org/abs/2409.18964
作者: Shaowei Liu,Zhongzheng Ren,Saurabh Gupta,Shenlong Wang
关键词-EN: temporally consistent video, input condition, force and torque, method that converts, converts a single
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to ECCV 2024. Project page: this https URL

点击查看摘要

Abstract:We present PhysGen, a novel image-to-video generation method that converts a single image and an input condition (e.g., force and torque applied to an object in the image) to produce a realistic, physically plausible, and temporally consistent video. Our key insight is to integrate model-based physical simulation with a data-driven video generation process, enabling plausible image-space dynamics. At the heart of our system are three core components: (i) an image understanding module that effectively captures the geometry, materials, and physical parameters of the image; (ii) an image-space dynamics simulation model that utilizes rigid-body physics and inferred parameters to simulate realistic behaviors; and (iii) an image-based rendering and refinement module that leverages generative video diffusion to produce realistic video footage featuring the simulated motion. The resulting videos are realistic in both physics and appearance and are even precisely controllable, showcasing superior results over existing data-driven image-to-video generation works through quantitative comparison and comprehensive user study. PhysGen’s resulting videos can be used for various downstream applications, such as turning an image into a realistic animation or allowing users to interact with the image and create various dynamics. Project page: this https URL

[AI-1] Exploring Token Pruning in Vision State Space Models NEURIPS’24

链接: https://arxiv.org/abs/2409.18962
作者: Zheng Zhan,Zhenglun Kong,Yifan Gong,Yushu Wu,Zichong Meng,Hangyu Zheng,Xuan Shen,Stratis Ioannidis,Wei Niu,Pu Zhao,Yanzhi Wang
关键词-EN: State Space Models, powerful vision foundation, keeping linear computational, linear computational complexity, computational complexity compared
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: NeurIPS’24

点击查看摘要

Abstract:State Space Models (SSMs) have the advantage of keeping linear computational complexity compared to attention modules in transformers, and have been applied to vision tasks as a new type of powerful vision foundation model. Inspired by the observations that the final prediction in vision transformers (ViTs) is only based on a subset of most informative tokens, we take the novel step of enhancing the efficiency of SSM-based vision models through token-based pruning. However, direct applications of existing token pruning techniques designed for ViTs fail to deliver good performance, even with extensive fine-tuning. To address this issue, we revisit the unique computational characteristics of SSMs and discover that naive application disrupts the sequential token positions. This insight motivates us to design a novel and general token pruning method specifically for SSM-based vision models. We first introduce a pruning-aware hidden state alignment method to stabilize the neighborhood of remaining tokens for performance enhancement. Besides, based on our detailed analysis, we propose a token importance evaluation method adapted for SSM models, to guide the token pruning. With efficient implementation and practical acceleration methods, our method brings actual speedup. Extensive experiments demonstrate that our approach can achieve significant computation reduction with minimal impact on performance across different tasks. Notably, we achieve 81.7% accuracy on ImageNet with a 41.6% reduction in the FLOPs for pruned PlainMamba-L3. Furthermore, our work provides deeper insights into understanding the behavior of SSM-based vision models for future research.

[AI-2] ProMerge: Prompt and Merge for Unsupervised Instance Segmentation ECCV2024

链接: https://arxiv.org/abs/2409.18961
作者: Dylan Li,Gyungin Shin
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: ECCV2024 camera-ready

点击查看摘要

[AI-3] O(d/T) Convergence Theory for Diffusion Probabilistic Models under Minimal Assumptions

链接: https://arxiv.org/abs/2409.18959
作者: Gen Li,Yuling Yan
关键词-EN: Score-based diffusion models, Score-based diffusion, achieved remarkable success, diffusion models, generative tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Score-based diffusion models, which generate new data by learning to reverse a diffusion process that perturbs data from the target distribution into noise, have achieved remarkable success across various generative tasks. Despite their superior empirical performance, existing theoretical guarantees are often constrained by stringent assumptions or suboptimal convergence rates. In this paper, we establish a fast convergence theory for a popular SDE-based sampler under minimal assumptions. Our analysis shows that, provided \ell_2 -accurate estimates of the score functions, the total variation distance between the target and generated distributions is upper bounded by O(d/T) (ignoring logarithmic factors), where d is the data dimensionality and T is the number of steps. This result holds for any target distribution with finite first-order moment. To our knowledge, this improves upon existing convergence theory for both the SDE-based sampler and another ODE-based sampler, while imposing minimal assumptions on the target data distribution and score estimates. This is achieved through a novel set of analytical tools that provides a fine-grained characterization of how the error propagates at each step of the reverse process.

[AI-4] LML: Language Model Learning a Dataset for Data-Augmented Prediction

链接: https://arxiv.org/abs/2409.18957
作者: Praneeth Vadlapati
关键词-EN: Large Language Models, Language Model Learning, Large Language, Machine Learning Model, Explainable Machine Learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: First version

点击查看摘要

Abstract:This paper introduces a new approach to using Large Language Models (LLMs) for classification tasks, which are typically handled using Machine Learning (ML) models. Unlike ML models that rely heavily on data cleaning and feature engineering, this method streamlines the process using LLMs. This paper proposes a new concept called “Language Model Learning (LML)” powered by a new method called “Data-Augmented Prediction (DAP)”. The classification is performed by LLMs using a method similar to humans manually exploring and understanding the data and deciding classifications using data as a reference. Training data is summarized and evaluated to determine the features that lead to the classification of each label the most. In the process of DAP, the system uses the data summary to automatically create a query, which is used to retrieve relevant rows from the dataset. A classification is generated by the LLM using data summary and relevant rows, ensuring satisfactory accuracy even with complex data. Usage of data summary and similar data in DAP ensures context-aware decision-making. The proposed method uses the words “Act as an Explainable Machine Learning Model” in the prompt to enhance the interpretability of the predictions by allowing users to review the logic behind each prediction. In some test cases, the system scored an accuracy above 90%, proving the effectiveness of the system and its potential to outperform conventional ML models in various scenarios. The code is available at this https URL

[AI-5] Building Trust Through Voice: How Vocal Tone Impacts User Perception of Attractiveness of Voice Assistants

链接: https://arxiv.org/abs/2409.18941
作者: Sabid Bin Habib Pias,Alicia Freel,Ran Huang,Donald Williamson,Minjeong Kim,Apu Kapadia
关键词-EN: Voice Assistants, online shopping, popular for simple, activities like online, simple tasks
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Extended Abstract

点击查看摘要

Abstract:Voice Assistants (VAs) are popular for simple tasks, but users are often hesitant to use them for complex activities like online shopping. We explored whether the vocal characteristics like the VA’s vocal tone, can make VAs perceived as more attractive and trustworthy to users for complex tasks. Our findings show that the tone of the VA voice significantly impacts its perceived attractiveness and trustworthiness. Participants in our experiment were more likely to be attracted to VAs with positive or neutral tones and ultimately trusted the VAs they found more attractive. We conclude that VA’s perceived trustworthiness can be enhanced through thoughtful voice design, incorporating a variety of vocal tones.

[AI-6] From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding

链接: https://arxiv.org/abs/2409.18938
作者: Heqing Zou,Tianze Luo,Guiyang Xie,Victor(Xiao Jie)Zhang,Fengmao Lv,Guangcong Wang,Juanyang Chen,Zhuochen Wang,Hansheng Zhang,Huaijian Zhang
关键词-EN: Large Language Models, Large Language, MultiModal Large Language, Language Models, recently shown promising
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 11 pages

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) with visual encoders has recently shown promising performance in visual understanding tasks, leveraging their inherent capability to comprehend and generate human-like text for visual reasoning. Given the diverse nature of visual data, MultiModal Large Language Models (MM-LLMs) exhibit variations in model designing and training for understanding images, short videos, and long videos. Our paper focuses on the substantial differences and unique challenges posed by long video understanding compared to static image and short video understanding. Unlike static images, short videos encompass sequential frames with both spatial and within-event temporal information, while long videos consist of multiple events with between-event and long-term temporal information. In this survey, we aim to trace and summarize the advancements of MM-LLMs from image understanding to long video understanding. We review the differences among various visual understanding tasks and highlight the challenges in long video understanding, including more fine-grained spatiotemporal details, dynamic events, and long-term dependencies. We then provide a detailed summary of the advancements in MM-LLMs in terms of model design and training methodologies for understanding long videos. Finally, we compare the performance of existing MM-LLMs on video understanding benchmarks of various lengths and discuss potential future directions for MM-LLMs in long video understanding.

[AI-7] AIPatient: Simulating Patients with EHRs and LLM Powered Agent ic Workflow

链接: https://arxiv.org/abs/2409.18924
作者: Huizi Yu,Jiayan Zhou,Lingyao Li,Shan Chen,Jack Gallifant,Anye Shi,Xiang Li,Wenyue Hua,Mingyu Jin,Guang Chen,Yang Zhou,Zhao Li,Trisha Gupte,Ming-Li Chen,Zahra Azizi,Yongfeng Zhang,Themistocles L. Assimes,Xin Ma,Danielle S. Bitterman,Lin Lu,Lizhou Fan
关键词-EN: integrative learning environments, clinical decision-making simulations, enabling clinical decision-making, Simulated patient systems, Simulated patient
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 42 pages, 6 figures, 7 tables

点击查看摘要

Abstract:Simulated patient systems play a crucial role in modern medical education and research, providing safe, integrative learning environments and enabling clinical decision-making simulations. Large Language Models (LLM) could advance simulated patient systems by replicating medical conditions and patient-doctor interactions with high fidelity and low cost. However, ensuring the effectiveness and trustworthiness of these systems remains a challenge, as they require a large, diverse, and precise patient knowledgebase, along with a robust and stable knowledge diffusion to users. Here, we developed AIPatient, an advanced simulated patient system with AIPatient Knowledge Graph (AIPatient KG) as the input and the Reasoning Retrieval-Augmented Generation (Reasoning RAG) agentic workflow as the generation backbone. AIPatient KG samples data from Electronic Health Records (EHRs) in the Medical Information Mart for Intensive Care (MIMIC)-III database, producing a clinically diverse and relevant cohort of 1,495 patients with high knowledgebase validity (F1 0.89). Reasoning RAG leverages six LLM powered agents spanning tasks including retrieval, KG query generation, abstraction, checker, rewrite, and summarization. This agentic framework reaches an overall accuracy of 94.15% in EHR-based medical Question Answering (QA), outperforming benchmarks that use either no agent or only partial agent integration. Our system also presents high readability (median Flesch Reading Ease 77.23; median Flesch Kincaid Grade 5.6), robustness (ANOVA F-value 0.6126, p0.1), and stability (ANOVA F-value 0.782, p0.1). The promising performance of the AIPatient system highlights its potential to support a wide range of applications, including medical education, model evaluation, and system integration.

[AI-8] Soft Measures for Extracting Causal Collective Intelligence EMNLP2024

链接: https://arxiv.org/abs/2409.18911
作者: Maryam Berijanian,Spencer Dork,Kuldeep Singh,Michael Riley Millikan,Ashlin Riggs,Aadarsh Swaminathan,Sarah L. Gibbs,Scott E. Friedman,Nathan Brugnone
关键词-EN: addressing complex social, complex social systems, essential for addressing, addressing complex, complex social
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
*备注: Camera-ready version accepted for publication in the EMNLP 2024 Workshop NLP4Science

点击查看摘要

Abstract:Understanding and modeling collective intelligence is essential for addressing complex social systems. Directed graphs called fuzzy cognitive maps (FCMs) offer a powerful tool for encoding causal mental models, but extracting high-integrity FCMs from text is challenging. This study presents an approach using large language models (LLMs) to automate FCM extraction. We introduce novel graph-based similarity measures and evaluate them by correlating their outputs with human judgments through the Elo rating system. Results show positive correlations with human evaluations, but even the best-performing measure exhibits limitations in capturing FCM nuances. Fine-tuning LLMs improves performance, but existing measures still fall short. This study highlights the need for soft similarity measures tailored to FCM extraction, advancing collective intelligence modeling with NLP.

[AI-9] Improving Visual Object Tracking through Visual Prompting

链接: https://arxiv.org/abs/2409.18901
作者: Shih-Fang Chen,Jun-Cheng Chen,I-Hong Jhuo,Yen-Yu Lin
关键词-EN: visual object tracking, generic visual object, visual prompt, visual, object tracking
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Image and Video Processing (eess.IV)
*备注: Accepted and to appear in IEEE Transactions on Multimedia

点击查看摘要

Abstract:Learning a discriminative model to distinguish a target from its surrounding distractors is essential to generic visual object tracking. Dynamic target representation adaptation against distractors is challenging due to the limited discriminative capabilities of prevailing trackers. We present a new visual Prompting mechanism for generic Visual Object Tracking (PiVOT) to address this issue. PiVOT proposes a prompt generation network with the pre-trained foundation model CLIP to automatically generate and refine visual prompts, enabling the transfer of foundation model knowledge for tracking. While CLIP offers broad category-level knowledge, the tracker, trained on instance-specific data, excels at recognizing unique object instances. Thus, PiVOT first compiles a visual prompt highlighting potential target locations. To transfer the knowledge of CLIP to the tracker, PiVOT leverages CLIP to refine the visual prompt based on the similarities between candidate objects and the reference templates across potential targets. Once the visual prompt is refined, it can better highlight potential target locations, thereby reducing irrelevant prompt information. With the proposed prompting mechanism, the tracker can generate improved instance-aware feature maps through the guidance of the visual prompt, thus effectively reducing distractors. The proposed method does not involve CLIP during training, thereby keeping the same training complexity and preserving the generalization capability of the pretrained foundation model. Extensive experiments across multiple benchmarks indicate that PiVOT, using the proposed prompting method can suppress distracting objects and enhance the tracker.

[AI-10] Multi-Source Hard and Soft Information Fusion Approach for Accurate Cryptocurrency Price Movement Prediction

链接: https://arxiv.org/abs/2409.18895
作者: Saeed Mohammadi Dashtaki,Mehdi Hosseini Chagahi,Behzad Moshiri,Md. Jalil Piran
关键词-EN: cryptocurrency price trends, field is accurately, cryptocurrency price, information, price
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:One of the most important challenges in the financial and cryptocurrency field is accurately predicting cryptocurrency price trends. Leveraging artificial intelligence (AI) is beneficial in addressing this challenge. Cryptocurrency markets, marked by substantial growth and volatility, attract investors and scholars keen on deciphering and forecasting cryptocurrency price movements. The vast and diverse array of data available for such predictions increases the complexity of the task. In our study, we introduce a novel approach termed hard and soft information fusion (HSIF) to enhance the accuracy of cryptocurrency price movement forecasts. The hard information component of our approach encompasses historical price records alongside technical indicators. Complementing this, the soft data component extracts from X (formerly Twitter), encompassing news headlines and tweets about the cryptocurrency. To use this data, we use the Bidirectional Encoder Representations from Transformers (BERT)-based sentiment analysis method, financial BERT (FinBERT), which performs best. Finally, our model feeds on the information set including processed hard and soft data. We employ the bidirectional long short-term memory (BiLSTM) model because processing information in both forward and backward directions can capture long-term dependencies in sequential information. Our empirical findings emphasize the superiority of the HSIF approach over models dependent on single-source data by testing on Bitcoin-related data. By fusing hard and soft information on Bitcoin dataset, our model has about 96.8% accuracy in predicting price movement. Incorporating information enables our model to grasp the influence of social sentiment on price fluctuations, thereby supplementing the technical analysis-based predictions derived from hard information.

[AI-11] Suicide Phenotyping from Clinical Notes in Safety-Net Psychiatric Hospital Using Multi-Label Classification with Pre-Trained Language Models

链接: https://arxiv.org/abs/2409.18878
作者: Zehan Li,Yan Hu,Scott Lane,Salih Selek,Lokesh Shahani,Rodrigo Machado-Vieira,Jair Soares,Hua Xu,Hongfang Liu,Ming Huang
关键词-EN: reducing operational burden, improving care quality, Accurate identification, high-acuity psychiatric settings, reducing operational
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注: submitted to AMIA Informatics Summit 2025 as a conference paper

点击查看摘要

Abstract:Accurate identification and categorization of suicidal events can yield better suicide precautions, reducing operational burden, and improving care quality in high-acuity psychiatric settings. Pre-trained language models offer promise for identifying suicidality from unstructured clinical narratives. We evaluated the performance of four BERT-based models using two fine-tuning strategies (multiple single-label and single multi-label) for detecting coexisting suicidal events from 500 annotated psychiatric evaluation notes. The notes were labeled for suicidal ideation (SI), suicide attempts (SA), exposure to suicide (ES), and non-suicidal self-injury (NSSI). RoBERTa outperformed other models using binary relevance (acc=0.86, F1=0.78). MentalBERT (F1=0.74) also exceeded BioClinicalBERT (F1=0.72). RoBERTa fine-tuned with a single multi-label classifier further improved performance (acc=0.88, F1=0.81), highlighting that models pre-trained on domain-relevant data and the single multi-label classification strategy enhance efficiency and performance. Keywords: EHR-based Phynotyping; Natural Language Processing; Secondary Use of EHR Data; Suicide Classification; BERT-based Model; Psychiatry; Mental Health Comments: submitted to AMIA Informatics Summit 2025 as a conference paper Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR) Cite as: arXiv:2409.18878 [cs.CL] (or arXiv:2409.18878v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.18878 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-12] UniEmoX: Cross-modal Semantic-Guided Large-Scale Pretraining for Universal Scene Emotion Perception

链接: https://arxiv.org/abs/2409.18877
作者: Chuang Chen,Xiao Sun,Zhi Liu
关键词-EN: Visual emotion analysis, analysis holds significant, emotion analysis holds, holds significant research, emotion analysis
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to TIP

点击查看摘要

Abstract:Visual emotion analysis holds significant research value in both computer vision and psychology. However, existing methods for visual emotion analysis suffer from limited generalizability due to the ambiguity of emotion perception and the diversity of data scenarios. To tackle this issue, we introduce UniEmoX, a cross-modal semantic-guided large-scale pretraining framework. Inspired by psychological research emphasizing the inseparability of the emotional exploration process from the interaction between individuals and their environment, UniEmoX integrates scene-centric and person-centric low-level image spatial structural information, aiming to derive more nuanced and discriminative emotional representations. By exploiting the similarity between paired and unpaired image-text samples, UniEmoX distills rich semantic knowledge from the CLIP model to enhance emotional embedding representations more effectively. To the best of our knowledge, this is the first large-scale pretraining framework that integrates psychological theories with contemporary contrastive learning and masked image modeling techniques for emotion analysis across diverse scenarios. Additionally, we develop a visual emotional dataset titled Emo8. Emo8 samples cover a range of domains, including cartoon, natural, realistic, science fiction and advertising cover styles, covering nearly all common emotional scenes. Comprehensive experiments conducted on six benchmark datasets across two downstream tasks validate the effectiveness of UniEmoX. The source code is available at this https URL.

[AI-13] CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly Detection and Forecasting

链接: https://arxiv.org/abs/2409.18874
作者: Josef Koumar,Karel Hynek,Tomáš Čejka,Pavel Šiška
关键词-EN: identifying malicious activities, Anomaly detection, malicious activities, Anomaly, crucial for maintaining
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Anomaly detection in network traffic is crucial for maintaining the security of computer networks and identifying malicious activities. One of the primary approaches to anomaly detection are methods based on forecasting. Nevertheless, extensive real-world network datasets for forecasting and anomaly detection techniques are missing, potentially causing performance overestimation of anomaly detection algorithms. This manuscript addresses this gap by introducing a dataset comprising time series data of network entities’ behavior, collected from the CESNET3 network. The dataset was created from 40 weeks of network traffic of 275 thousand active IP addresses. The ISP origin of the presented data ensures a high level of variability among network entities, which forms a unique and authentic challenge for forecasting and anomaly detection models. It provides valuable insights into the practical deployment of forecast-based anomaly detection approaches.

[AI-14] Individuation in Neural Models with and without Visual Grounding

链接: https://arxiv.org/abs/2409.18868
作者: Alexey Tikhonov,Lisa Bylinina,Ivan P. Yamshchikov
关键词-EN: FastText and SBERT, CLIP, SBERT, CLIP embeddings, individuation information
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We show differences between a language-and-vision model CLIP and two text-only models - FastText and SBERT - when it comes to the encoding of individuation information. We study latent representations that CLIP provides for substrates, granular aggregates, and various numbers of objects. We demonstrate that CLIP embeddings capture quantitative differences in individuation better than models trained on text-only data. Moreover, the individuation hierarchy we deduce from the CLIP embeddings agrees with the hierarchies proposed in linguistics and cognitive science.

[AI-15] Mitigating Selection Bias with Node Pruning and Auxiliary Options

链接: https://arxiv.org/abs/2409.18857
作者: Hyeong Kyu Choi,Weijie Xu,Chi Xue,Stephanie Eckman,Chandan K. Reddy
关键词-EN: Large language models, posing significant reliability, significant reliability concerns, Large language, show unwarranted preference
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) often show unwarranted preference for certain choice options when responding to multiple-choice questions, posing significant reliability concerns in LLM-automated systems. To mitigate this selection bias problem, previous solutions utilized debiasing methods to adjust the model’s input and/or output. Our work, in contrast, investigates the model’s internal representation of the selection bias. Specifically, we introduce a novel debiasing approach, Bias Node Pruning (BNP), which eliminates the linear layer parameters that contribute to the bias. Furthermore, we present Auxiliary Option Injection (AOI), a simple yet effective input modification technique for debiasing, which is compatible even with black-box LLMs. To provide a more systematic evaluation of selection bias, we review existing metrics and introduce Choice Kullback-Leibler Divergence (CKLD), which addresses the insensitivity of the commonly used metrics to label imbalance. Experiments show that our methods are robust and adaptable across various datasets when applied to three LLMs.

[AI-16] LLMs4Synthesis: Leveraging Large Language Models for Scientific Synthesis

链接: https://arxiv.org/abs/2409.18812
作者: Hamed Babaei Giglou,Jennifer D’Souza,Sören Auer
关键词-EN: Large Language Models, Language Models, Large Language, capabilities of Large, generating high-quality scientific
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
*备注: 12 pages, 3 figures, Accepted to JCDL 2024 Research Track

点击查看摘要

Abstract:In response to the growing complexity and volume of scientific literature, this paper introduces the LLMs4Synthesis framework, designed to enhance the capabilities of Large Language Models (LLMs) in generating high-quality scientific syntheses. This framework addresses the need for rapid, coherent, and contextually rich integration of scientific insights, leveraging both open-source and proprietary LLMs. It also examines the effectiveness of LLMs in evaluating the integrity and reliability of these syntheses, alleviating inadequacies in current quantitative metrics. Our study contributes to this field by developing a novel methodology for processing scientific papers, defining new synthesis types, and establishing nine detailed quality criteria for evaluating syntheses. The integration of LLMs with reinforcement learning and AI feedback is proposed to optimize synthesis quality, ensuring alignment with established criteria. The LLMs4Synthesis framework and its components are made available, promising to enhance both the generation and evaluation processes in scientific research synthesis.

[AI-17] LLM With Tools: A Survey

链接: https://arxiv.org/abs/2409.18807
作者: Zhuocheng Shen
关键词-EN: large language models, language models presents, augmenting large language, complex tasks, language models
类目: Artificial Intelligence (cs.AI)
*备注: 10 pages

点击查看摘要

Abstract:The integration of tools in augmenting large language models presents a novel approach toward enhancing the efficiency and accuracy of these models in handling specific, complex tasks. This paper delves into the methodology,challenges, and developments in the realm of teaching LLMs to use external tools, thereby pushing the boundaries of their capabilities beyond pre-existing knowledge bases. We introduce a standardized paradigm for tool integration guided by a series of functions that map user instructions to actionable plans and their execution, emphasizing the significance of understanding user intent, tool selection, and dynamic plan adjustment. Our exploration reveals the various challenges encountered, such as tool invocation timing, selection accuracy, and the need for robust reasoning processes. In addressing these challenges, we investigate techniques within the context of fine-tuning and incontext learning paradigms, highlighting innovative approaches to ensure diversity, augment datasets, and improve this http URL, we investigate a perspective on enabling LLMs to not only utilize but also autonomously create tools, which may redefine their role from mere tool users to tool creators. Finally,we reproduced Chameleon’s results on ScienceQA and analyzed the code structure.

[AI-18] Esports Debut as a Medal Event at 2023 Asian Games: Exploring Public Perceptions with BERTopic and GPT-4 Topic Fine-Tuning

链接: https://arxiv.org/abs/2409.18798
作者: Tyreal Yizhou Qian,Bo Yu,Weizhe Li,Chenglong Xu
关键词-EN: Asian Games, BERTopic modeling analysis, LLM-enhanced BERTopic modeling, modeling analysis, LLM-enhanced BERTopic
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study examined the public opinions of esports at the 2023 Asian Games and value co-creation during the event using an LLM-enhanced BERTopic modeling analysis. We identified five major themes representing public perceptions, as well as how major stakeholders co-created value within and beyond the esports ecosystem. Key findings highlighted the strategic use of social media marketing to influence public opinion and promote esports events and brands, emphasizing the importance of event logistics and infrastructure. Additionally, the study revealed the co-creation value contributed by stakeholders outside the traditional esports ecosystem, particularly in promoting national representation and performance. Our findings supported the ongoing efforts to legitimize esports as a sport, noting that mainstream recognition remains a challenge. The inclusion of esports as a medal event showcased broader acceptance and helped mitigate negative public perceptions. Moreover, contributions from non-traditional stakeholders underscored the value of cross-subcultural collaborations in esports.

[AI-19] Supervised Learning Model for Key Frame Identification from Cow Teat Videos

链接: https://arxiv.org/abs/2409.18797
作者: Minghao Wang,Pinxue Lin
关键词-EN: cow teat, mastitis risk assessment, proposes a method, method for improving, cow
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:This paper proposes a method for improving the accuracy of mastitis risk assessment in cows using neural networks and video analysis. Mastitis, an infection of the udder tissue, is a critical health problem for cows and can be detected by examining the cow’s teat. Traditionally, veterinarians assess the health of a cow’s teat during the milking process, but this process is limited in time and can weaken the accuracy of the assessment. In commercial farms, cows are recorded by cameras when they are milked in the milking parlor. This paper uses a neural network to identify key frames in the recorded video where the cow’s udder appears intact. These key frames allow veterinarians to have more flexible time to perform health assessments on the teat, increasing their efficiency and accuracy. However, there are challenges in using cow teat video for mastitis risk assessment, such as complex environments, changing cow positions and postures, and difficulty in identifying the udder from the video. To address these challenges, a fusion distance and an ensemble model are proposed to improve the performance (F-score) of identifying key frames from cow teat videos. The results show that these two approaches improve performance compared to using a single distance measure or model.

[AI-20] Hierarchical Federated ADMM

链接: https://arxiv.org/abs/2409.18796
作者: Seyed Mohammad Azimi-Abarghouyi,Nicola Bastianello,Karl H. Johansson,Viktoria Fodor
关键词-EN: alternating direction method, descent-based hierarchical federated, hierarchical federated learning, gradient descent-based hierarchical, widely-used gradient descent-based
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:In this paper, we depart from the widely-used gradient descent-based hierarchical federated learning (FL) algorithms to develop a novel hierarchical FL framework based on the alternating direction method of multipliers (ADMM). Within this framework, we propose two novel FL algorithms, which both use ADMM in the top layer: one that employs ADMM in the lower layer and another that uses the conventional gradient descent-based approach. The proposed framework enhances privacy, and experiments demonstrate the superiority of the proposed algorithms compared to the conventional algorithms in terms of learning convergence and accuracy. Additionally, gradient descent on the lower layer performs well even if the number of local steps is very limited, while ADMM on both layers lead to better performance otherwise.

[AI-21] A Survey on the Honesty of Large Language Models

链接: https://arxiv.org/abs/2409.18786
作者: Siheng Li,Cheng Yang,Taiqiang Wu,Chufan Shi,Yuji Zhang,Xinyu Zhu,Zesen Cheng,Deng Cai,Mo Yu,Lemao Liu,Jie Zhou,Yujiu Yang,Ngai Wong,Xixin Wu,Wai Lam
关键词-EN: large language models, aligning large language, language models, fundamental principle, principle for aligning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Honesty is a fundamental principle for aligning large language models (LLMs) with human values, requiring these models to recognize what they know and don’t know and be able to faithfully express their knowledge. Despite promising, current LLMs still exhibit significant dishonest behaviors, such as confidently presenting wrong answers or failing to express what they know. In addition, research on the honesty of LLMs also faces challenges, including varying definitions of honesty, difficulties in distinguishing between known and unknown knowledge, and a lack of comprehensive understanding of related research. To address these issues, we provide a survey on the honesty of LLMs, covering its clarification, evaluation approaches, and strategies for improvement. Moreover, we offer insights for future research, aiming to inspire further exploration in this important area.

[AI-22] HardCore Generation: Generating Hard UNSAT Problems for Data Augmentation

链接: https://arxiv.org/abs/2409.18778
作者: Joseph Cotnareanu,Zhanguang Zhang,Hui-Ling Zhen,Yingxue Zhang,Mark Coates
关键词-EN: boolean equation, determining the satisfiability, SAT problems, SAT, deep learning methods
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Efficiently determining the satisfiability of a boolean equation – known as the SAT problem for brevity – is crucial in various industrial problems. Recently, the advent of deep learning methods has introduced significant potential for enhancing SAT solving. However, a major barrier to the advancement of this field has been the scarcity of large, realistic datasets. The majority of current public datasets are either randomly generated or extremely limited, containing only a few examples from unrelated problem families. These datasets are inadequate for meaningful training of deep learning methods. In light of this, researchers have started exploring generative techniques to create data that more accurately reflect SAT problems encountered in practical situations. These methods have so far suffered from either the inability to produce challenging SAT problems or time-scalability obstacles. In this paper we address both by identifying and manipulating the key contributors to a problem’s ``hardness’', known as cores. Although some previous work has addressed cores, the time costs are unacceptably high due to the expense of traditional heuristic core detection techniques. We introduce a fast core detection procedure that uses a graph neural network. Our empirical results demonstrate that we can efficiently generate problems that remain hard to solve and retain key attributes of the original example problems. We show via experiment that the generated synthetic SAT problems can be used in a data augmentation setting to provide improved prediction of solver runtimes.

[AI-23] State-of-the-Art Periorbital Distance Prediction and Disease Classification Using Periorbital Features

链接: https://arxiv.org/abs/2409.18769
作者: George R. Nahass,Ghasem Yazdanpanah,Madison Cheung,Alex Palacios,Jeffery Peterson,Kevin Heinze,Sasha Hubschman,Chad A. Purnell,Pete Setabutr,Ann Q. Tran,Darvin Yi
关键词-EN: lids hold valuable, hold valuable information, medical intervention, lids hold, hold valuable
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 16 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Periorbital distances and features around the eyes and lids hold valuable information for disease quantification and monitoring of surgical and medical intervention. These distances are commonly measured manually, a process that is both subjective and highly time-consuming. Here, we set out to developed three deep-learning methods for segmentation and periorbital distance prediction, and also evaluate the utility of periorbital distances for disease classification. The MAE of our deep learning predicted distances was less than or very close to the error observed between trained human annotators. We compared our models to the current state-of-the-art (SOTA) method for periorbital distance prediction and found that our methods outperformed SOTA on all of our datasets on all but one periorbital measurement. We also show that robust segmentation can be achieved on diseased eyes using models trained on open-source, healthy eyes, and that periorbital distances have can be used as high-quality features in downstream classification models. Leveraging segmentation networks as intermediary steps in classification has broad implications for increasing the generalizability of classification models in ophthalmic plastic and craniofacial surgery by avoiding the out-of-distribution problem observed in traditional convolutional neural networks.

[AI-24] Learning from Demonstration with Implicit Nonlinear Dynamics Models

链接: https://arxiv.org/abs/2409.18768
作者: Peter David Fagan,Subramanian Ramamoorthy
关键词-EN: involving complex motions, solve tasks involving, tasks involving complex, Learning from Demonstration, paradigm for training
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: 21 pages, 9 figures

点击查看摘要

Abstract:Learning from Demonstration (LfD) is a useful paradigm for training policies that solve tasks involving complex motions. In practice, the successful application of LfD requires overcoming error accumulation during policy execution, i.e. the problem of drift due to errors compounding over time and the consequent out-of-distribution behaviours. Existing works seek to address this problem through scaling data collection, correcting policy errors with a human-in-the-loop, temporally ensembling policy predictions or through learning the parameters of a dynamical system model. In this work, we propose and validate an alternative approach to overcoming this issue. Inspired by reservoir computing, we develop a novel neural network layer that includes a fixed nonlinear dynamical system with tunable dynamical properties. We validate the efficacy of our neural network layer on the task of reproducing human handwriting motions using the LASA Human Handwriting Dataset. Through empirical experiments we demonstrate that incorporating our layer into existing neural network architectures addresses the issue of compounding errors in LfD. Furthermore, we perform a comparative evaluation against existing approaches including a temporal ensemble of policy predictions and an Echo State Networks (ESNs) implementation. We find that our approach yields greater policy precision and robustness on the handwriting task while also generalising to multiple dynamics regimes and maintaining competitive latency scores.

[AI-25] OpenObject-NAV: Open-Vocabulary Object-Oriented Navigation Based on Dynamic Carrier-Relationship Scene Graph

链接: https://arxiv.org/abs/2409.18743
作者: Yujie Tang,Meiling Wang,Yinan Deng,Zibo Zheng,Jiagui Zhong,Yufeng Yue
关键词-EN: unfixed positions, positions and multiple, multiple instances, update scene changes, update scene
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Project website: this https URL

点击查看摘要

Abstract:In everyday life, frequently used objects like cups often have unfixed positions and multiple instances within the same category, and their carriers frequently change as well. As a result, it becomes challenging for a robot to efficiently navigate to a specific instance. To tackle this challenge, the robot must capture and update scene changes and plans continuously. However, current object navigation approaches primarily focus on semantic-level and lack the ability to dynamically update scene representation. This paper captures the relationships between frequently used objects and their static carriers. It constructs an open-vocabulary Carrier-Relationship Scene Graph (CRSG) and updates the carrying status during robot navigation to reflect the dynamic changes of the scene. Based on the CRSG, we further propose an instance navigation strategy that models the navigation process as a Markov Decision Process. At each step, decisions are informed by Large Language Model’s commonsense knowledge and visual-language feature similarity. We designed a series of long-sequence navigation tasks for frequently used everyday items in the Habitat simulator. The results demonstrate that by updating the CRSG, the robot can efficiently navigate to moved targets. Additionally, we deployed our algorithm on a real robot and validated its practical effectiveness.

[AI-26] MemFusionMap: Working Memory Fusion for Online Vectorized HD Map Construction

链接: https://arxiv.org/abs/2409.18737
作者: Jingyu Song,Xudong Chen,Liupei Lu,Jie Li,Katherine A. Skinner
关键词-EN: autonomous driving systems, maps provide environmental, provide environmental information, safe planning, provide environmental
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:High-definition (HD) maps provide environmental information for autonomous driving systems and are essential for safe planning. While existing methods with single-frame input achieve impressive performance for online vectorized HD map construction, they still struggle with complex scenarios and occlusions. We propose MemFusionMap, a novel temporal fusion model with enhanced temporal reasoning capabilities for online HD map construction. Specifically, we contribute a working memory fusion module that improves the model’s memory capacity to reason across history frames. We also design a novel temporal overlap heatmap to explicitly inform the model about the temporal overlap information and vehicle trajectory in the Bird’s Eye View space. By integrating these two designs, MemFusionMap significantly outperforms existing methods while also maintaining a versatile design for scalability. We conduct extensive evaluation on open-source benchmarks and demonstrate a maximum improvement of 5.4% in mAP over state-of-the-art methods. The code for MemFusionMap will be made open-source upon publication of this paper.

[AI-27] Autoregressive Policy Optimization for Constrained Allocation Tasks NEURIPS2024

链接: https://arxiv.org/abs/2409.18735
作者: David Winkel,Niklas Strauß,Maximilian Bernhard,Zongyue Li,Thomas Seidl,Matthias Schubert
关键词-EN: Allocation tasks represent, Allocation tasks, represent a class, class of problems, limited amount
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2024

点击查看摘要

Abstract:Allocation tasks represent a class of problems where a limited amount of resources must be allocated to a set of entities at each time step. Prominent examples of this task include portfolio optimization or distributing computational workloads across servers. Allocation tasks are typically bound by linear constraints describing practical requirements that have to be strictly fulfilled at all times. In portfolio optimization, for example, investors may be obligated to allocate less than 30% of the funds into a certain industrial sector in any investment period. Such constraints restrict the action space of allowed allocations in intricate ways, which makes learning a policy that avoids constraint violations difficult. In this paper, we propose a new method for constrained allocation tasks based on an autoregressive process to sequentially sample allocations for each entity. In addition, we introduce a novel de-biasing mechanism to counter the initial bias caused by sequential sampling. We demonstrate the superior performance of our approach compared to a variety of Constrained Reinforcement Learning (CRL) methods on three distinct constrained allocation tasks: portfolio optimization, computational workload distribution, and a synthetic allocation benchmark. Our code is available at: this https URL

[AI-28] Read Over the Lines: Attacking LLMs and Toxicity Detection Systems with ASCII Art to Mask Profanity

链接: https://arxiv.org/abs/2409.18708
作者: Sergey Berezin,Reza Farahbakhsh,Noel Crespi
关键词-EN: interpret ASCII art, ASCII art fonts, ASCII art, custom ASCII art, interpret ASCII
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:We introduce a novel family of adversarial attacks that exploit the inability of language models to interpret ASCII art. To evaluate these attacks, we propose the ToxASCII benchmark and develop two custom ASCII art fonts: one leveraging special tokens and another using text-filled letter shapes. Our attacks achieve a perfect 1.0 Attack Success Rate across ten models, including OpenAI’s o1-preview and LLaMA 3.1. Warning: this paper contains examples of toxic language used for research purposes. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2409.18708 [cs.CL] (or arXiv:2409.18708v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.18708 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-29] Semantic Model Component Implementation for Model-driven Semantic Communications

链接: https://arxiv.org/abs/2409.18704
作者: Haotai Liang,Mengran Shi,Chen Dong,Xiaodong Xu,Long Liu,Hao Chen
关键词-EN: model-driven semantic communication, semantic component model, model, semantic model component, edge node
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The key feature of model-driven semantic communication is the propagation of the model. The semantic model component (SMC) is designed to drive the intelligent model to transmit in the physical channel, allowing the intelligence to flow through the networks. According to the characteristics of neural networks with common and individual model parameters, this paper designs the cross-source-domain and cross-task semantic component model. Considering that the basic model is deployed on the edge node, the large server node updates the edge node by transmitting only the semantic component model to the edge node so that the edge node can handle different sources and different tasks. In addition, this paper also discusses how channel noise affects the performance of the model and proposes methods of injection noise and regularization to improve the noise resistance of the model. Experiments show that SMCs use smaller model parameters to achieve cross-source, cross-task functionality while maintaining performance and improving the model’s tolerance to noise. Finally, a component transfer-based unmanned vehicle tracking prototype was implemented to verify the feasibility of model components in practical applications.

[AI-30] KALE-LM: Unleash The Power Of AI For Science Via Knowledge And Logic Enhanced Large Model

链接: https://arxiv.org/abs/2409.18695
作者: Weichen Dai,Yezeng Chen,Zijie Dai,Zhijie Huang,Yubo Liu,Yixuan Pan,Baiyang Song,Chengli Zhong,Xinhe Li,Zeyu Wang,Zhuoying Feng,Yi Zhou
关键词-EN: advance scientific research, Artificial intelligence, immense potential, intelligence is gradually, gradually demonstrating
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Artificial intelligence is gradually demonstrating its immense potential, and increasing attention is being given to how AI can be harnessed to advance scientific research. In this vision paper, we present our perspectives on how AI can better assist scientific inquiry and explore corresponding technical approach. We have proposed and open-sourced a large model of our KALE-LM model series, Llama3-KALE-LM-Chem-8B, which has achieved outstanding performance in tasks related to the field of chemistry. We hope that our work serves as a strong starting point, helping to realize more intelligent AI and promoting the advancement of human science and technology, as well as societal development.

[AI-31] Learning from Pattern Completion: Self-supervised Controllable Generation

链接: https://arxiv.org/abs/2409.18694
作者: Zhiqiang Chen,Guofan Fan,Jinying Gao,Lei Ma,Bo Lei,Tiejun Huang,Shan Yu
关键词-EN: similar visual scene, real-world visual objects, visual scene, visual objects, visual attributes
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The human brain exhibits a strong ability to spontaneously associate different visual attributes of the same or similar visual scene, such as associating sketches and graffiti with real-world visual objects, usually without supervising information. In contrast, in the field of artificial intelligence, controllable generation methods like ControlNet heavily rely on annotated training datasets such as depth maps, semantic segmentation maps, and poses, which limits the method’s scalability. Inspired by the neural mechanisms that may contribute to the brain’s associative power, specifically the cortical modularization and hippocampal pattern completion, here we propose a self-supervised controllable generation (SCG) framework. Firstly, we introduce an equivariant constraint to promote inter-module independence and intra-module correlation in a modular autoencoder network, thereby achieving functional specialization. Subsequently, based on these specialized modules, we employ a self-supervised pattern completion approach for controllable generation training. Experimental results demonstrate that the proposed modular autoencoder effectively achieves functional specialization, including the modular processing of color, brightness, and edge detection, and exhibits brain-like features including orientation selectivity, color antagonism, and center-surround receptive fields. Through self-supervised training, associative generation capabilities spontaneously emerge in SCG, demonstrating excellent generalization ability to various tasks such as associative generation on painting, sketches, and ancient graffiti. Compared to the previous representative method ControlNet, our proposed approach not only demonstrates superior robustness in more challenging high-noise scenarios but also possesses more promising scalability potential due to its self-supervised manner.

[AI-32] Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models EMNLP24

链接: https://arxiv.org/abs/2409.18680
作者: Yiming Chen,Xianghu Yue,Xiaoxue Gao,Chen Zhang,Luis Fernando D’Haro,Robby T. Tan,Haizhou Li
关键词-EN: unified model, explored recently, recently for tackling, proposed MALLM, audio
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: EMNLP24 Findings

点击查看摘要

Abstract:Various audio-LLMs (ALLMs) have been explored recently for tackling different audio tasks simultaneously using a single, unified model. While existing evaluations of ALLMs primarily focus on single-audio tasks, real-world applications often involve processing multiple audio streams simultaneously. To bridge this gap, we propose the first multi-audio evaluation (MAE) benchmark that consists of 20 datasets from 11 multi-audio tasks encompassing both speech and sound scenarios. Comprehensive experiments on MAE demonstrate that the existing ALLMs, while being powerful in comprehending primary audio elements in individual audio inputs, struggling to handle multi-audio scenarios. To this end, we propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios using discriminative learning on our proposed synthetic data. The results demonstrate that the proposed MALLM outperforms all baselines and achieves high data efficiency using synthetic data without requiring human annotations. The proposed MALLM opens the door for ALLMs towards multi-audio processing era and brings us closer to replicating human auditory capabilities in machines.

[AI-33] oward Universal and Interpretable World Models for Open-ended Learning Agents

链接: https://arxiv.org/abs/2409.18676
作者: Lancelot Da Costa
关键词-EN: generative world models, supports open-ended learning, introduce a generic, world models, class of generative
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Neurons and Cognition (q-bio.NC)
*备注: 4 pages including appendix, 6 including appendix and references; 2 figures

点击查看摘要

Abstract:We introduce a generic, compositional and interpretable class of generative world models that supports open-ended learning agents. This is a sparse class of Bayesian networks capable of approximating a broad range of stochastic processes, which provide agents with the ability to learn world models in a manner that may be both interpretable and computationally scalable. This approach integrating Bayesian structure learning and intrinsically motivated (model-based) planning enables agents to actively develop and refine their world models, which may lead to open-ended learning and more robust, adaptive behavior.

[AI-34] Exploiting Motion Prior for Accurate Pose Estimation of Dashboard Cameras

链接: https://arxiv.org/abs/2409.18673
作者: Yipeng Lu,Yifan Zhao,Haiping Wang,Zhiwei Ruan,Yuan Liu,Zhen Dong,Bisheng Yang
关键词-EN: driving videos daily, including driving map, driving map production, valuable potential data, potential data source
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Dashboard cameras (dashcams) record millions of driving videos daily, offering a valuable potential data source for various applications, including driving map production and updates. A necessary step for utilizing these dashcam data involves the estimation of camera poses. However, the low-quality images captured by dashcams, characterized by motion blurs and dynamic objects, pose challenges for existing image-matching methods in accurately estimating camera poses. In this study, we propose a precise pose estimation method for dashcam images, leveraging the inherent camera motion prior. Typically, image sequences captured by dash cameras exhibit pronounced motion prior, such as forward movement or lateral turns, which serve as essential cues for correspondence estimation. Building upon this observation, we devise a pose regression module aimed at learning camera motion prior, subsequently integrating these prior into both correspondences and pose estimation processes. The experiment shows that, in real dashcams dataset, our method is 22% better than the baseline for pose estimation in AUC5\textdegree, and it can estimate poses for 19% more images with less reprojection error in Structure from Motion (SfM).

[AI-35] Not the Silver Bullet: LLM-enhanced Programming Error Messages are Ineffective in Practice

链接: https://arxiv.org/abs/2409.18661
作者: Eddie Antonio Santos,Brett A. Becker
关键词-EN: large language models, computing education community, error messages, error, language models
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: To appear in the proceedings of the 2024 UK and Ireland Computing Education Research conference (UKICER '24)

点击查看摘要

Abstract:The sudden emergence of large language models (LLMs) such as ChatGPT has had a disruptive impact throughout the computing education community. LLMs have been shown to excel at producing correct code to CS1 and CS2 problems, and can even act as friendly assistants to students learning how to code. Recent work shows that LLMs demonstrate unequivocally superior results in being able to explain and resolve compiler error messages – for decades, one of the most frustrating parts of learning how to code. However, LLM-generated error message explanations have only been assessed by expert programmers in artificial conditions. This work sought to understand how novice programmers resolve programming error messages (PEMs) in a more realistic scenario. We ran a within-subjects study with n = 106 participants in which students were tasked to fix six buggy C programs. For each program, participants were randomly assigned to fix the problem using either a stock compiler error message, an expert-handwritten error message, or an error message explanation generated by GPT-4. Despite promising evidence on synthetic benchmarks, we found that GPT-4 generated error messages outperformed conventional compiler error messages in only 1 of the 6 tasks, measured by students’ time-to-fix each problem. Handwritten explanations still outperform LLM and conventional error messages, both on objective and subjective measures.

[AI-36] When SAM2 Meets Video Camouflaged Object Segmentation: A Comprehensive Evaluation and Adaptation

链接: https://arxiv.org/abs/2409.18653
作者: Yuli Zhou,Guolei Sun,Yawei Li,Luca Benini,Ender Konukoglu
关键词-EN: investigates the application, VCOS, camouflaged object segmentation, Segment, camouflaged
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Technical report

点击查看摘要

Abstract:This study investigates the application and performance of the Segment Anything Model 2 (SAM2) in the challenging task of video camouflaged object segmentation (VCOS). VCOS involves detecting objects that blend seamlessly in the surroundings for videos, due to similar colors and textures, poor light conditions, etc. Compared to the objects in normal scenes, camouflaged objects are much more difficult to detect. SAM2, a video foundation model, has shown potential in various tasks. But its effectiveness in dynamic camouflaged scenarios remains under-explored. This study presents a comprehensive study on SAM2’s ability in VCOS. First, we assess SAM2’s performance on camouflaged video datasets using different models and prompts (click, box, and mask). Second, we explore the integration of SAM2 with existing multimodal large language models (MLLMs) and VCOS methods. Third, we specifically adapt SAM2 by fine-tuning it on the video camouflaged dataset. Our comprehensive experiments demonstrate that SAM2 has excellent zero-shot ability of detecting camouflaged objects in videos. We also show that this ability could be further improved by specifically adjusting SAM2’s parameters for VCOS. The code will be available at this https URL

[AI-37] Enhanced Convolution Neural Network with Optimized Pooling and Hyperparameter Tuning for Network Intrusion Detection

链接: https://arxiv.org/abs/2409.18642
作者: Ayush Kumar Sharma,Sourav Patel,Supriya Bharat Wakchaure,Abirami S
关键词-EN: Denial of Service, Intrusion Detection Systems, protecting computer networks, Detection Systems, Network Intrusion Detection
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 Pages , 2 figures , 4 Tables , Conference paper

点击查看摘要

Abstract:Network Intrusion Detection Systems (NIDS) are essential for protecting computer networks from malicious activities, including Denial of Service (DoS), Probing, User-to-Root (U2R), and Remote-to-Local (R2L) attacks. Without effective NIDS, networks are vulnerable to significant security breaches and data loss. Machine learning techniques provide a promising approach to enhance NIDS by automating threat detection and improving accuracy. In this research, we propose an Enhanced Convolutional Neural Network (EnCNN) for NIDS and evaluate its performance using the KDDCUP’99 dataset. Our methodology includes comprehensive data preprocessing, exploratory data analysis (EDA), and feature engineering. We compare EnCNN with various machine learning algorithms, including Logistic Regression, Decision Trees, Support Vector Machines (SVM), and ensemble methods like Random Forest, AdaBoost, and Voting Ensemble. The results show that EnCNN significantly improves detection accuracy, with a notable 10% increase over state-of-art approaches. This demonstrates the effectiveness of EnCNN in real-time network intrusion detection, offering a robust solution for identifying and mitigating security threats, and enhancing overall network resilience.

[AI-38] Reducing Diversity to Generate Hierarchical Archetypes

链接: https://arxiv.org/abs/2409.18633
作者: Alfredo Ibias,Hector Antona,Guillem Ramirez-Miranda,Enric Guinovart,Eduard Alarcon
关键词-EN: Artificial Intelligence field, Intelligence field seldom, fundamental building piece, Artificial Intelligence, field seldom address
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Artificial Intelligence field seldom address the development of a fundamental building piece: a framework, methodology or algorithm to automatically build hierarchies of abstractions. This is a key requirement in order to build intelligent behaviour, as recent neuroscience studies clearly expose. In this paper we present a primitive-based framework to automatically generate hierarchies of constructive archetypes, as a theory of how to generate hierarchies of abstractions. We assume the existence of a primitive with very specific characteristics, and we develop our framework over it. We prove the effectiveness of our framework through mathematical definitions and proofs. Finally, we give a few insights about potential uses of our framework and the expected results.

[AI-39] Entropy concentration and learning: a statistical mechanics primer

链接: https://arxiv.org/abs/2409.18630
作者: Akshay Balsubramani
关键词-EN:
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:

点击查看摘要

[AI-40] Refutation of Spectral Graph Theory Conjectures with Search Algorithms)

链接: https://arxiv.org/abs/2409.18626
作者: Milo Roucairol,Tristan Cazenave
关键词-EN: automatic refutation, spectral graph theory, deep reinforcement learning, graph theory conjectures, graph theory
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We are interested in the automatic refutation of spectral graph theory conjectures. Most existing works address this problem either with the exhaustive generation of graphs with a limited size or with deep reinforcement learning. Exhaustive generation is limited by the size of the generated graphs and deep reinforcement learning takes hours or days to refute a conjecture. We propose to use search algorithms to address these shortcomings to find potentially large counter-examples to spectral graph theory conjectures in seconds. We apply a wide range of search algorithms to a selection of conjectures from Graffiti. Out of 13 already refuted conjectures from Graffiti, our algorithms are able to refute 12 in seconds. We also refute conjecture 197 from Graffiti which was open until now.

[AI-41] Unsupervised Cognition

链接: https://arxiv.org/abs/2409.18624
作者: Alfredo Ibias,Hector Antona,Guillem Ramirez-Miranda,Enric Guinovart,Eduard Alarcon
关键词-EN: Unsupervised learning methods, Unsupervised learning, soft inspiration, learning methods, Unsupervised
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unsupervised learning methods have a soft inspiration in cognition models. To this day, the most successful unsupervised learning methods revolve around clustering samples in a mathematical space. In this paper we propose a state-of-the-art primitive-based unsupervised learning approach for decision-making inspired by novel cognition models. This representation-centric approach models the input space constructively as a distributed hierarchical structure in an input-agnostic way. We compared our approach with current state-of-the-art in unsupervised learning classification, and with current state-of-the-art in cancer type classification. We show how our proposal outperforms previous state-of-the-art. We also evaluate some cognition-like properties of our proposal where it not only outperforms the compared algorithms (even supervised learning ones), but it also shows a different, more cognition-like, behaviour.

[AI-42] Model-based Preference Optimization in Abstractive Summarization without Human Feedback EMNLP2024

链接: https://arxiv.org/abs/2409.18618
作者: Jaepill Choi,Kyubyung Chae,Jiwoo Song,Yohan Jo,Taesup Kim
关键词-EN: accurate summaries arises, Large Language Models, challenge of producing, producing concise, concise and accurate
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted by EMNLP 2024

点击查看摘要

Abstract:In abstractive summarization, the challenge of producing concise and accurate summaries arises from the vast amount of information contained in the source document. Consequently, although Large Language Models (LLMs) can generate fluent text, they often introduce inaccuracies by hallucinating content not found in the original source. While supervised fine-tuning methods that maximize likelihood contribute to this issue, they do not consistently enhance the faithfulness of the summaries. Preference-based optimization methods, such as Direct Preference Optimization (DPO), can further refine the model to align with human preferences. However, these methods still heavily depend on costly human feedback. In this work, we introduce a novel and straightforward approach called Model-based Preference Optimization (MPO) to fine-tune LLMs for improved summarization abilities without any human feedback. By leveraging the model’s inherent summarization capabilities, we create a preference dataset that is fully generated by the model using different decoding strategies. Our experiments on standard summarization datasets and various metrics demonstrate that our proposed MPO significantly enhances the quality of generated summaries without relying on human feedback.

[AI-43] mporalPaD: a reinforcement-learning framework for temporal feature representation and dimension reduction

链接: https://arxiv.org/abs/2409.18597
作者: Xuechen Mu,Zhenyu Huang,Kewei Li,Haotian Zhang,Xiuli Wang,Yusi Fan,Kai Zhang,Fengfeng Zhou
关键词-EN: Recent advancements, Representation Module, predictive modeling, Policy Module, Module
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:Recent advancements in feature representation and dimension reduction have highlighted their crucial role in enhancing the efficacy of predictive modeling. This work introduces TemporalPaD, a novel end-to-end deep learning framework designed for temporal pattern datasets. TemporalPaD integrates reinforcement learning (RL) with neural networks to achieve concurrent feature representation and feature reduction. The framework consists of three cooperative modules: a Policy Module, a Representation Module, and a Classification Module, structured based on the Actor-Critic (AC) framework. The Policy Module, responsible for dimensionality reduction through RL, functions as the actor, while the Representation Module for feature extraction and the Classification Module collectively serve as the critic. We comprehensively evaluate TemporalPaD using 29 UCI datasets, a well-known benchmark for validating feature reduction algorithms, through 10 independent tests and 10-fold cross-validation. Additionally, given that TemporalPaD is specifically designed for time series data, we apply it to a real-world DNA classification problem involving enhancer category and enhancer strength. The results demonstrate that TemporalPaD is an efficient and effective framework for achieving feature reduction, applicable to both structured data and sequence datasets. The source code of the proposed TemporalPaD is freely available as supplementary material to this article and at this http URL.

[AI-44] ASAG2024: A Combined Benchmark for Short Answer Grading

链接: https://arxiv.org/abs/2409.18596
作者: Gérôme Meyer,Philip Breuer,Jonathan Fürst
关键词-EN: Open-ended questions test, Open-ended questions, preferred assessment method, understanding than closed-ended, preferred assessment
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted at SIGCSE-Virtual 2024

点击查看摘要

Abstract:Open-ended questions test a more thorough understanding than closed-ended questions and are often a preferred assessment method. However, open-ended questions are tedious to grade and subject to personal bias. Therefore, there have been efforts to speed up the grading process through automation. Short Answer Grading (SAG) systems aim to automatically score students’ answers. Despite growth in SAG methods and capabilities, there exists no comprehensive short-answer grading benchmark across different subjects, grading scales, and distributions. Thus, it is hard to assess the capabilities of current automated grading methods in terms of their generalizability. In this preliminary work, we introduce the combined ASAG2024 benchmark to facilitate the comparison of automated grading systems. Combining seven commonly used short-answer grading datasets in a common structure and grading scale. For our benchmark, we evaluate a set of recent SAG methods, revealing that while LLM-based approaches reach new high scores, they still are far from reaching human performance. This opens up avenues for future research on human-machine SAG systems.

[AI-45] “Oh LLM Im Asking Thee Please Give Me a Decision Tree”: Zero-Shot Decision Tree Induction and Embedding with Large Language Models

链接: https://arxiv.org/abs/2409.18594
作者: Ricardo Knauer,Mario Koddenbrock,Raphael Wallsberger,Nicholas M. Brisson,Georg N. Duda,Deborah Falla,David W. Evans,Erik Rodner
关键词-EN: Large language models, Large language, leverage prior knowledge, provide powerful, leverage prior
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) provide powerful means to leverage prior knowledge for predictive modeling when data is limited. In this work, we demonstrate how LLMs can use their compressed world knowledge to generate intrinsically interpretable machine learning models, i.e., decision trees, without any training data. We find that these zero-shot decision trees can surpass data-driven trees on some small-sized tabular datasets and that embeddings derived from these trees perform on par with data-driven tree-based embeddings on average. Our knowledge-driven decision tree induction and embedding approaches therefore serve as strong new baselines for data-driven machine learning methods in the low-data regime.

[AI-46] Analysis of Truncated Singular Value Decomposition for Koopman Operator-Based Lane Change Model

链接: https://arxiv.org/abs/2409.18586
作者: Chinnawut Nantabut
关键词-EN: enhancing vehicle performance, modeling complex dynamic, Understanding and modeling, Dynamic Mode Decomposition, Extended Dynamic Mode
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Submitted to the 21st International Conference on Informatics in Control, Automation and Robotics (ICINCO 2024)

点击查看摘要

Abstract:Understanding and modeling complex dynamic systems is crucial for enhancing vehicle performance and safety, especially in the context of autonomous driving. Recently, popular methods such as Koopman operators and their approximators, known as Extended Dynamic Mode Decomposition (EDMD), have emerged for their effectiveness in transforming strongly nonlinear system behavior into linear representations. This allows them to be integrated with conventional linear controllers. To achieve this, Singular Value Decomposition (SVD), specifically truncated SVD, is employed to approximate Koopman operators from extensive datasets efficiently. This study evaluates different basis functions used in EDMD and ranks for truncated SVD for representing lane change behavior models, aiming to balance computational efficiency with information loss. The findings, however, suggest that the technique of truncated SVD does not necessarily achieve substantial reductions in computational training time and results in significant information loss.

[AI-47] An Enhanced Federated Prototype Learning Method under Domain Shift

链接: https://arxiv.org/abs/2409.18578
作者: Liang Kuang,Kuangpu Guo,Jian Liang,Jianguo Zhang
关键词-EN: machine learning training, collaborative machine learning, sharing private data, Federated Learning, Federated Prototype Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:Federated Learning (FL) allows collaborative machine learning training without sharing private data. Numerous studies have shown that one significant factor affecting the performance of federated learning models is the heterogeneity of data across different clients, especially when the data is sampled from various domains. A recent paper introduces variance-aware dual-level prototype clustering and uses a novel \alpha -sparsity prototype loss, which increases intra-class similarity and reduces inter-class similarity. To ensure that the features converge within specific clusters, we introduce an improved algorithm, Federated Prototype Learning with Convergent Clusters, abbreviated as FedPLCC. To increase inter-class distances, we weight each prototype with the size of the cluster it represents. To reduce intra-class distances, considering that prototypes with larger distances might come from different domains, we select only a certain proportion of prototypes for the loss function calculation. Evaluations on the Digit-5, Office-10, and DomainNet datasets show that our method performs better than existing approaches.

[AI-48] Experimental Evaluation of Machine Learning Models for Goal-oriented Customer Service Chatbot with Pipeline Architecture

链接: https://arxiv.org/abs/2409.18568
作者: Nurul Ain Nabilah Mohd Isa,Siti Nuraishah Agos Jawaddi,Azlan Ismail
关键词-EN: Integrating machine learning, Integrating machine, ultimately improving service, customer service chatbots, service chatbots enhances
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Integrating machine learning (ML) into customer service chatbots enhances their ability to understand and respond to user queries, ultimately improving service performance. However, they may appear artificial to some users and affecting customer experience. Hence, meticulous evaluation of ML models for each pipeline component is crucial for optimizing performance, though differences in functionalities can lead to unfair comparisons. In this paper, we present a tailored experimental evaluation approach for goal-oriented customer service chatbots with pipeline architecture, focusing on three key components: Natural Language Understanding (NLU), dialogue management (DM), and Natural Language Generation (NLG). Our methodology emphasizes individual assessment to determine optimal ML models. Specifically, we focus on optimizing hyperparameters and evaluating candidate models for NLU (utilizing BERT and LSTM), DM (employing DQN and DDQN), and NLG (leveraging GPT-2 and DialoGPT). The results show that for the NLU component, BERT excelled in intent detection whereas LSTM was superior for slot filling. For the DM component, the DDQN model outperformed DQN by achieving fewer turns, higher rewards, as well as greater success rates. For NLG, the large language model GPT-2 surpassed DialoGPT in BLEU, METEOR, and ROUGE metrics. These findings aim to provide a benchmark for future research in developing and optimizing customer service chatbots, offering valuable insights into model performance and optimal hyperparameters.

[AI-49] Efficient Noise Mitigation for Enhancing Inference Accuracy in DNNs on Mixed-Signal Accelerators

链接: https://arxiv.org/abs/2409.18553
作者: Seyedarmin Azizi,Mohammad Erfan Sadeghi,Mehdi Kamal,Massoud Pedram
关键词-EN: analog computing components, analog neural networks, analog computing, framework to enhance, mitigating the effects
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we propose a framework to enhance the robustness of the neural models by mitigating the effects of process-induced and aging-related variations of analog computing components on the accuracy of the analog neural networks. We model these variations as the noise affecting the precision of the activations and introduce a denoising block inserted between selected layers of a pre-trained model. We demonstrate that training the denoising block significantly increases the model’s robustness against various noise levels. To minimize the overhead associated with adding these blocks, we present an exploration algorithm to identify optimal insertion points for the denoising blocks. Additionally, we propose a specialized architecture to efficiently execute the denoising blocks, which can be integrated into mixed-signal accelerators. We evaluate the effectiveness of our approach using Deep Neural Network (DNN) models trained on the ImageNet and CIFAR-10 datasets. The results show that on average, by accepting 2.03% parameter count overhead, the accuracy drop due to the variations reduces from 31.7% to 1.15%.

[AI-50] Research on Predicting Public Opinion Event Heat Levels Based on Large Language Models

链接: https://arxiv.org/abs/2409.18548
作者: Yi Ren,Tianyi Zhang,Weibin Li,DuoMu Zhou,Chenhao Qin,FangCheng Dong
关键词-EN: demonstrated extraordinary capabilities, surpassing human performance, event heat level, heat level prediction, heat level
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: conference

点击查看摘要

Abstract:In recent years, with the rapid development of large language models, serval models such as GPT-4o have demonstrated extraordinary capabilities, surpassing human performance in various language tasks. As a result, many researchers have begun exploring their potential applications in the field of public opinion analysis. This study proposes a novel large-language-models-based method for public opinion event heat level prediction. First, we preprocessed and classified 62,836 Chinese hot event data collected between July 2022 and December 2023. Then, based on each event’s online dissemination heat index, we used the MiniBatchKMeans algorithm to automatically cluster the events and categorize them into four heat levels (ranging from low heat to very high heat). Next, we randomly selected 250 events from each heat level, totalling 1,000 events, to build the evaluation dataset. During the evaluation process, we employed various large language models to assess their accuracy in predicting event heat levels in two scenarios: without reference cases and with similar case references. The results showed that GPT-4o and DeepseekV2 performed the best in the latter case, achieving prediction accuracies of 41.4% and 41.5%, respectively. Although the overall prediction accuracy remains relatively low, it is worth noting that for low-heat (Level 1) events, the prediction accuracies of these two models reached 73.6% and 70.4%, respectively. Additionally, the prediction accuracy showed a downward trend from Level 1 to Level 4, which correlates with the uneven distribution of data across the heat levels in the actual dataset. This suggests that with the more robust dataset, public opinion event heat level prediction based on large language models will have significant research potential for the future.

[AI-51] An Epistemic Human-Aware Task Planner which Anticipates Human Beliefs and Decisions

链接: https://arxiv.org/abs/2409.18545
作者: Shashank Shekhar,Anthony Favier,Rachid Alami
关键词-EN: Human-Aware Task Planning, intermittent shared execution, significant belief divergence, uncontrollable human behaviors, tailored for scenarios
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 15 pages, 4 figures, 1 table

点击查看摘要

Abstract:We present a substantial extension of our Human-Aware Task Planning framework, tailored for scenarios with intermittent shared execution experiences and significant belief divergence between humans and robots, particularly due to the uncontrollable nature of humans. Our objective is to build a robot policy that accounts for uncontrollable human behaviors, thus enabling the anticipation of possible advancements achieved by the robot when the execution is not shared, e.g. when humans are briefly absent from the shared environment to complete a subtask. But, this anticipation is considered from the perspective of humans who have access to an estimated model for the robot. To this end, we propose a novel planning framework and build a solver based on AND-OR search, which integrates knowledge reasoning, including situation assessment by perspective taking. Our approach dynamically models and manages the expansion and contraction of potential advances while precisely keeping track of when (and when not) agents share the task execution experience. The planner systematically assesses the situation and ignores worlds that it has reason to think are impossible for humans. Overall, our new solver can estimate the distinct beliefs of the human and the robot along potential courses of action, enabling the synthesis of plans where the robot selects the right moment for communication, i.e. informing, or replying to an inquiry, or defers ontic actions until the execution experiences can be shared. Preliminary experiments in two domains, one novel and one adapted, demonstrate the effectiveness of the framework.

[AI-52] Align2LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation

链接: https://arxiv.org/abs/2409.18541
作者: Hongzhe Huang,Zhewen Yu,Jiang Liu,Li Cai,Dian Jiao,Wenqiao Zhang,Siliang Tang,Juncheng Li,Hao Jiang,Haoyuan Li,Yueting Zhuang
关键词-EN: Multi-modal Large Language, Large Language Models, Multi-modal Large, Large Language, Recent advances
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in Multi-modal Large Language Models (MLLMs), such as LLaVA-series models, are driven by massive machine-generated instruction-following data tuning. Such automatic instruction collection pipelines, however, inadvertently introduce significant variability in data quality. This paper introduces a novel instruction curation algorithm, derived from two unique perspectives, human and LLM preference alignment, to compress this vast corpus of machine-generated multimodal instructions to a compact and high-quality form: (i) For human preference alignment, we have collected a machine-generated multimodal instruction dataset and established a comprehensive set of both subjective and objective criteria to guide the data quality assessment critically from human experts. By doing so, a reward model was trained on the annotated dataset to internalize the nuanced human understanding of instruction alignment. (ii) For LLM preference alignment, given the instruction selected by the reward model, we propose leveraging the inner LLM used in MLLM to align the writing style of visual instructions with that of the inner LLM itself, resulting in LLM-aligned instruction improvement. Extensive experiments demonstrate that we can maintain or even improve model performance by compressing synthetic multimodal instructions by up to 90%. Impressively, by aggressively reducing the total training sample size from 158k to 14k (9 \times smaller), our model consistently outperforms its full-size dataset counterpart across various MLLM benchmarks. Our project is available at this https URL.

[AI-53] EmoPro: A Prompt Selection Strategy for Emotional Expression in LM-based Speech Synthesis

链接: https://arxiv.org/abs/2409.18512
作者: Haoyu Wang,Chunyu Qiang,Tianrui Wang,Cheng Gong,Qiuyu Liu,Yu Jiang,Xiaobao Wang,Chenyang Wang,Chen Zhang
关键词-EN: remarkable zero-shot capabilities, demonstrated remarkable zero-shot, Recent advancements, trained on extensive, extensive datasets
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Recent advancements in speech synthesis models, trained on extensive datasets, have demonstrated remarkable zero-shot capabilities. These models can control content, timbre, and emotion in generated speech based on prompt inputs. Despite these advancements, the choice of prompts significantly impacts the output quality, yet most existing selection schemes do not adequately address the control of emotional intensity. To address this question, this paper proposes a two-stage prompt selection strategy EmoPro, which is specifically designed for emotionally controllable speech synthesis. This strategy focuses on selecting highly expressive and high-quality prompts by evaluating them from four perspectives: emotional expression strength, speech quality, text-emotion consistency, and model generation performance. Experimental results show that prompts selected using the proposed method result in more emotionally expressive and engaging synthesized speech compared to those obtained through baseline. Audio samples and codes will be available at this https URL.

[AI-54] Fairness-aware Multiobjective Evolutionary Learning

链接: https://arxiv.org/abs/2409.18499
作者: Qingquan Zhang,Jialin Liu,Xin Yao
关键词-EN: Multiobjective evolutionary learning, fairer machine learning, Multiobjective evolutionary, machine learning models, training fairer machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages

点击查看摘要

Abstract:Multiobjective evolutionary learning (MOEL) has demonstrated its advantages of training fairer machine learning models considering a predefined set of conflicting objectives, including accuracy and different fairness measures. Recent works propose to construct a representative subset of fairness measures as optimisation objectives of MOEL throughout model training. However, the determination of a representative measure set relies on dataset, prior knowledge and requires substantial computational costs. What’s more, those representative measures may differ across different model training processes. Instead of using a static predefined set determined before model training, this paper proposes to dynamically and adaptively determine a representative measure set online during model training. The dynamically determined representative set is then used as optimising objectives of the MOEL framework and can vary with time. Extensive experimental results on 12 well-known benchmark datasets demonstrate that our proposed framework achieves outstanding performance compared to state-of-the-art approaches for mitigating unfairness in terms of accuracy as well as 25 fairness measures although only a few of them were dynamically selected and used as optimisation objectives. The results indicate the importance of setting optimisation objectives dynamically during training.

[AI-55] Data Analysis in the Era of Generative AI

链接: https://arxiv.org/abs/2409.18475
作者: Jeevana Priya Inala,Chenglong Wang,Steven Drucker,Gonzalo Ramos,Victor Dibia,Nathalie Riche,Dave Brown,Dan Marshall,Jianfeng Gao
关键词-EN: reshape data analysis, data analysis workflow, potential of AI-powered, AI-powered tools, tools to reshape
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:This paper explores the potential of AI-powered tools to reshape data analysis, focusing on design considerations and challenges. We explore how the emergence of large language and multimodal models offers new opportunities to enhance various stages of data analysis workflow by translating high-level user intentions into executable code, charts, and insights. We then examine human-centered design principles that facilitate intuitive interactions, build user trust, and streamline the AI-assisted analysis workflow across multiple apps. Finally, we discuss the research challenges that impede the development of these AI-based systems such as enhancing model capabilities, evaluating and benchmarking, and understanding end-user needs.

[AI-56] owards Diverse Device Heterogeneous Federated Learning via Task Arithmetic Knowledge Integration NEURIPS2024

链接: https://arxiv.org/abs/2409.18461
作者: Mahdi Morafah,Vyacheslav Kungurtsev,Hojin Chang,Chen Chen,Bill Lin
关键词-EN: user data privacy, preserving user data, collaborative machine learning, Federated Learning, data privacy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Federated Learning has emerged as a promising paradigm for collaborative machine learning, while preserving user data privacy. Despite its potential, standard FL lacks support for diverse heterogeneous device prototypes, which vary significantly in model and dataset sizes – from small IoT devices to large workstations. This limitation is only partially addressed by existing knowledge distillation techniques, which often fail to transfer knowledge effectively across a broad spectrum of device prototypes with varied capabilities. This failure primarily stems from two issues: the dilution of informative logits from more capable devices by those from less capable ones, and the use of a single integrated logits as the distillation target across all devices, which neglects their individual learning capacities and and the unique contributions of each. To address these challenges, we introduce TAKFL, a novel KD-based framework that treats the knowledge transfer from each device prototype’s ensemble as a separate task, independently distilling each to preserve its unique contributions and avoid dilution. TAKFL also incorporates a KD-based self-regularization technique to mitigate the issues related to the noisy and unsupervised ensemble distillation process. To integrate the separately distilled knowledge, we introduce an adaptive task arithmetic knowledge integration process, allowing each student model to customize the knowledge integration for optimal performance. Additionally, we present theoretical results demonstrating the effectiveness of task arithmetic in transferring knowledge across heterogeneous devices with varying capacities. Comprehensive evaluations of our method across both CV and NLP tasks demonstrate that TAKFL achieves SOTA results in a variety of datasets and settings, significantly outperforming existing KD-based methods. Code is released at this https URL

[AI-57] Review of Digital Asset Development with Graph Neural Network Unlearning

链接: https://arxiv.org/abs/2409.18455
作者: Zara Lisbon
关键词-EN: rapidly evolving landscape, Graph Neural Networks, robust data privacy, rapidly evolving, evolving landscape
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the rapidly evolving landscape of digital assets, the imperative for robust data privacy and compliance with regulatory frameworks has intensified. This paper investigates the critical role of Graph Neural Networks (GNNs) in the management of digital assets and introduces innovative unlearning techniques specifically tailored to GNN architectures. We categorize unlearning strategies into two primary classes: data-driven approximation, which manipulates the graph structure to isolate and remove the influence of specific nodes, and model-driven approximation, which modifies the internal parameters and architecture of the GNN itself. By examining recent advancements in these unlearning methodologies, we highlight their applicability in various use cases, including fraud detection, risk assessment, token relationship prediction, and decentralized governance. We discuss the challenges inherent in balancing model performance with the requirements for data unlearning, particularly in the context of real-time financial applications. Furthermore, we propose a hybrid approach that combines the strengths of both unlearning strategies to enhance the efficiency and effectiveness of GNNs in digital asset ecosystems. Ultimately, this paper aims to provide a comprehensive framework for understanding and implementing GNN unlearning techniques, paving the way for secure and compliant deployment of machine learning in the digital asset domain.

[AI-58] Leveraging Long-Context Large Language Models for Multi-Document Understanding and Summarization in Enterprise Applications

链接: https://arxiv.org/abs/2409.18454
作者: Aditi Godbole,Jabin Geevarghese George,Smita Shandilya
关键词-EN: made multi-document comprehension, critical task, Long-context Large Language, rapid increase, increase in unstructured
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid increase in unstructured data across various fields has made multi-document comprehension and summarization a critical task. Traditional approaches often fail to capture relevant context, maintain logical consistency, and extract essential information from lengthy documents. This paper explores the use of Long-context Large Language Models (LLMs) for multi-document summarization, demonstrating their exceptional capacity to grasp extensive connections, provide cohesive summaries, and adapt to various industry domains and integration with enterprise applications/systems. The paper discusses the workflow of multi-document summarization for effectively deploying long-context LLMs, supported by case studies in legal applications, enterprise functions such as HR, finance, and sourcing, as well as in the medical and news domains. These case studies show notable enhancements in both efficiency and accuracy. Technical obstacles, such as dataset diversity, model scalability, and ethical considerations like bias mitigation and factual accuracy, are carefully analyzed. Prospective research avenues are suggested to augment the functionalities and applications of long-context LLMs, establishing them as pivotal tools for transforming information processing across diverse sectors and enterprise applications.

[AI-59] Cost-Aware Dynamic Cloud Workflow Scheduling using Self-Attention and Evolutionary Reinforcement Learning

链接: https://arxiv.org/abs/2409.18444
作者: Ya Shen,Gang Chen,Hui Ma,Mengjie Zhang
关键词-EN: Service Level Agreement, Cost-aware Dynamic Multi-Workflow, violating Service Level, Dynamic Multi-Workflow Scheduling, Level Agreement
类目: Artificial Intelligence (cs.AI)
*备注: This paper has been accepted by ICSOC (International Conference on Service-Oriented Computing) 2024

点击查看摘要

Abstract:The Cost-aware Dynamic Multi-Workflow Scheduling (CDMWS) in the cloud is a kind of cloud workflow management problem, which aims to assign virtual machine (VM) instances to execute tasks in workflows so as to minimize the total costs, including both the penalties for violating Service Level Agreement (SLA) and the VM rental fees. Powered by deep neural networks, Reinforcement Learning (RL) methods can construct effective scheduling policies for solving CDMWS problems. Traditional policy networks in RL often use basic feedforward architectures to separately determine the suitability of assigning any VM instances, without considering all VMs simultaneously to learn their global information. This paper proposes a novel self-attention policy network for cloud workflow scheduling (SPN-CWS) that captures global information from all VMs. We also develop an Evolution Strategy-based RL (ERL) system to train SPN-CWS reliably and effectively. The trained SPN-CWS can effectively process all candidate VM instances simultaneously to identify the most suitable VM instance to execute every workflow task. Comprehensive experiments show that our method can noticeably outperform several state-of-the-art algorithms on multiple benchmark CDMWS problems.

[AI-60] State-free Reinforcement Learning

链接: https://arxiv.org/abs/2409.18439
作者: Mingyu Chen,Aldo Pacchiano,Xuezhou Zhang
关键词-EN: reachable state set, states information, state space, textit, information before interacting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work, we study the \textitstate-free RL problem, where the algorithm does not have the states information before interacting with the environment. Specifically, denote the reachable state set by S^\Pi := \ s|\max_\pi\in \Piq^P, \pi(s)0 \ , we design an algorithm which requires no information on the state space S while having a regret that is completely independent of S and only depend on S^\Pi . We view this as a concrete first step towards \textitparameter-free RL, with the goal of designing RL algorithms that require no hyper-parameter tuning.

[AI-61] Physics Augmented Tuple Transformer for Autism Severity Level Detection

链接: https://arxiv.org/abs/2409.18438
作者: Chinthaka Ranasingha,Harshala Gammulle,Tharindu Fernando,Sridha Sridharan,Clinton Fookes
关键词-EN: Autism Spectrum Disorder, Spectrum Disorder, Autism Spectrum, ASD diagnosis, ASD
类目: Artificial Intelligence (cs.AI)
*备注: 12 pages

点击查看摘要

Abstract:Early diagnosis of Autism Spectrum Disorder (ASD) is an effective and favorable step towards enhancing the health and well-being of children with ASD. Manual ASD diagnosis testing is labor-intensive, complex, and prone to human error due to several factors contaminating the results. This paper proposes a novel framework that exploits the laws of physics for ASD severity recognition. The proposed physics-informed neural network architecture encodes the behaviour of the subject extracted by observing a part of the skeleton-based motion trajectory in a higher dimensional latent space. Two decoders, namely physics-based and non-physics-based decoder, use this latent embedding and predict the future motion patterns. The physics branch leverages the laws of physics that apply to a skeleton sequence in the prediction process while the non-physics-based branch is optimised to minimise the difference between the predicted and actual motion of the subject. A classifier also leverages the same latent space embeddings to recognise the ASD severity. This dual generative objective explicitly forces the network to compare the actual behaviour of the subject with the general normal behaviour of children that are governed by the laws of physics, aiding the ASD recognition task. The proposed method attains state-of-the-art performance on multiple ASD diagnosis benchmarks. To illustrate the utility of the proposed framework beyond the task ASD diagnosis, we conduct a third experiment using a publicly available benchmark for the task of fall prediction and demonstrate the superiority of our model.

[AI-62] Multi-agent Reinforcement Learning for Dynamic Dispatching in Material Handling Systems

链接: https://arxiv.org/abs/2409.18435
作者: Xian Yeow Lee,Haiyan Wang,Daisuke Katsumata,Takaharu Matsui,Chetan Gupta
关键词-EN: multi-agent reinforcement learning, material handling systems, diverse industries, material handling, multi-agent reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:This paper proposes a multi-agent reinforcement learning (MARL) approach to learn dynamic dispatching strategies, which is crucial for optimizing throughput in material handling systems across diverse industries. To benchmark our method, we developed a material handling environment that reflects the complexities of an actual system, such as various activities at different locations, physical constraints, and inherent uncertainties. To enhance exploration during learning, we propose a method to integrate domain knowledge in the form of existing dynamic dispatching heuristics. Our experimental results show that our method can outperform heuristics by up to 7.4 percent in terms of median throughput. Additionally, we analyze the effect of different architectures on MARL performance when training multiple agents with different functions. We also demonstrate that the MARL agents performance can be further improved by using the first iteration of MARL agents as heuristics to train a second iteration of MARL agents. This work demonstrates the potential of applying MARL to learn effective dynamic dispatching strategies that may be deployed in real-world systems to improve business outcomes.

[AI-63] Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization NEURIPS2024

链接: https://arxiv.org/abs/2409.18433
作者: Mucong Ding,Chenghao Deng,Jocelyn Choo,Zichu Wu,Aakriti Agrawal,Avi Schwarzschild,Tianyi Zhou,Tom Goldstein,John Langford,Anima Anandkumar,Furong Huang
关键词-EN: profile language models, fine-grained difficulty annotations, tasks from easy, easy to hard, hard is crucial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: NeurIPS 2024 Datasets and Benchmarks Track

点击查看摘要

Abstract:While generalization over tasks from easy to hard is crucial to profile language models (LLMs), the datasets with fine-grained difficulty annotations for each problem across a broad range of complexity are still blank. Aiming to address this limitation, we present Easy2Hard-Bench, a consistently formatted collection of 6 benchmark datasets spanning various domains, such as mathematics and programming problems, chess puzzles, and reasoning questions. Each problem within these datasets is annotated with numerical difficulty scores. To systematically estimate problem difficulties, we collect abundant performance data on attempts to each problem by humans in the real world or LLMs on the prominent leaderboard. Leveraging the rich performance data, we apply well-established difficulty ranking systems, such as Item Response Theory (IRT) and Glicko-2 models, to uniformly assign numerical difficulty scores to problems. Moreover, datasets in Easy2Hard-Bench distinguish themselves from previous collections by a higher proportion of challenging problems. Through extensive experiments with six state-of-the-art LLMs, we provide a comprehensive analysis of their performance and generalization capabilities across varying levels of difficulty, with the aim of inspiring future research in LLM generalization. The datasets are available at this https URL.

[AI-64] A3: Active Adversarial Alignment for Source-Free Domain Adaptation ICML

链接: https://arxiv.org/abs/2409.18418
作者: Chrisantus Eze,Christopher Crick
关键词-EN: Unsupervised domain adaptation, aims to transfer, source-free UDA, Unsupervised domain, transfer knowledge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ICMLA 2024

点击查看摘要

Abstract:Unsupervised domain adaptation (UDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain. Recent works have focused on source-free UDA, where only target data is available. This is challenging as models rely on noisy pseudo-labels and struggle with distribution shifts. We propose Active Adversarial Alignment (A3), a novel framework combining self-supervised learning, adversarial training, and active learning for robust source-free UDA. A3 actively samples informative and diverse data using an acquisition function for training. It adapts models via adversarial losses and consistency regularization, aligning distributions without source data access. A3 advances source-free UDA through its synergistic integration of active and adversarial learning for effective domain alignment and noise reduction.

[AI-65] VickreyFeedback: Cost-efficient Data Construction for Reinforcement Learning from Human Feedback

链接: https://arxiv.org/abs/2409.18417
作者: Guoxi Zhang,Jiuding Duan
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); General Economics (econ.GN)
*备注: 16 pages, 5 figures

点击查看摘要

[AI-66] SciDFM: A Large Language Model with Mixture-of-Experts for Science

链接: https://arxiv.org/abs/2409.18412
作者: Liangtai Sun,Danyu Luo,Da Ma,Zihan Zhao,Baocai Chen,Zhennan Shen,Su Zhu,Lu Chen,Xin Chen,Kai Yu
关键词-EN: leveraging large language, assist scientific discovery, amino acid sequences, large language models, significant upsurge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 12 pages, 1 figure, 9 tables. Technical Report, Under Review

点击查看摘要

Abstract:Recently, there has been a significant upsurge of interest in leveraging large language models (LLMs) to assist scientific discovery. However, most LLMs only focus on general science, while they lack domain-specific knowledge, such as chemical molecules and amino acid sequences. To bridge these gaps, we introduce SciDFM, a mixture-of-experts LLM, which is trained from scratch and is able to conduct college-level scientific reasoning and understand molecules and amino acid sequences. We collect a large-scale training corpus containing numerous scientific papers and books from different disciplines as well as data from domain-specific databases. We further fine-tune the pre-trained model on lots of instruction data to improve performances on downstream benchmarks. From experiment results, we show that SciDFM achieves strong performance on general scientific benchmarks such as SciEval and SciQ, and it reaches a SOTA performance on domain-specific benchmarks among models of similar size. We further analyze the expert layers and show that the results of expert selection vary with data from different disciplines. To benefit the broader research community, we open-source SciDFM at this https URL.

[AI-67] BoT-Drive: Hierarchical Behavior and Trajectory Planning for Autonomous Driving using POMDPs

链接: https://arxiv.org/abs/2409.18411
作者: Xuanjin Jin,Chendong Zeng,Shengfa Zhu,Chunxiao Liu,Panpan Cai
关键词-EN: dynamic road environments, road environments pose, Markov Decision Process, Partially Observable Markov, Observable Markov Decision
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Uncertainties in dynamic road environments pose significant challenges for behavior and trajectory planning in autonomous driving. This paper introduces BoT-Drive, a planning algorithm that addresses uncertainties at both behavior and trajectory levels within a Partially Observable Markov Decision Process (POMDP) framework. BoT-Drive employs driver models to characterize unknown behavioral intentions and utilizes their model parameters to infer hidden driving styles. By also treating driver models as decision-making actions for the autonomous vehicle, BoT-Drive effectively tackles the exponential complexity inherent in POMDPs. To enhance safety and robustness, the planner further applies importance sampling to refine the driving trajectory conditioned on the planned high-level behavior. Evaluation on real-world data shows that BoT-Drive consistently outperforms both existing planning methods and learning-based methods in regular and complex urban driving scenes, demonstrating significant improvements in driving safety and reliability.

[AI-68] GenesisTex2: Stable Consistent and High-Quality Text-to-Texture Generation

链接: https://arxiv.org/abs/2409.18401
作者: Jiawei Lu,Yingpeng Zhang,Zengjun Zhao,He Wang,Kun Zhou,Tianjia Shao
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-69] Multimodal Trajectory Prediction for Autonomous Driving on Unstructured Roads using Deep Convolutional Network

链接: https://arxiv.org/abs/2409.18399
作者: Lei Li,Zhifa Chen,Jian Wang,Bin Zhou,Guizhen Yu,Xiaoxuan Chen
关键词-EN: efficient mineral transportation, garnered increasing attention, mineral transportation, garnered increasing, increasing attention
类目: Artificial Intelligence (cs.AI)
*备注: 11 pages,6 figures

点击查看摘要

Abstract:Recently, the application of autonomous driving in open-pit mining has garnered increasing attention for achieving safe and efficient mineral transportation. Compared to urban structured roads, unstructured roads in mining sites have uneven boundaries and lack clearly defined lane markings. This leads to a lack of sufficient constraint information for predicting the trajectories of other human-driven vehicles, resulting in higher uncertainty in trajectory prediction problems. A method is proposed to predict multiple possible trajectories and their probabilities of the target vehicle. The surrounding environment and historical trajectories of the target vehicle are encoded as a rasterized image, which is used as input to our deep convolutional network to predict the target vehicle’s multiple possible trajectories. The method underwent offline testing on a dataset specifically designed for autonomous driving scenarios in open-pit mining and was compared and evaluated against physics-based method. The open-source code and data are available at this https URL

[AI-70] Code Vulnerability Repair with Large Language Model using Context-Aware Prompt Tuning

链接: https://arxiv.org/abs/2409.18395
作者: Arshiya Khan,Guannan Liu,Xing Gao
关键词-EN: Large Language Models, Large Language, Language Models, involving multiple aspects, shown significant challenges
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown significant challenges in detecting and repairing vulnerable code, particularly when dealing with vulnerabilities involving multiple aspects, such as variables, code flows, and code structures. In this study, we utilize GitHub Copilot as the LLM and focus on buffer overflow vulnerabilities. Our experiments reveal a notable gap in Copilot’s abilities when dealing with buffer overflow vulnerabilities, with a 76% vulnerability detection rate but only a 15% vulnerability repair rate. To address this issue, we propose context-aware prompt tuning techniques designed to enhance LLM performance in repairing buffer overflow. By injecting a sequence of domain knowledge about the vulnerability, including various security and code contexts, we demonstrate that Copilot’s successful repair rate increases to 63%, representing more than four times the improvement compared to repairs without domain knowledge.

[AI-71] Speech to Reality: On-Demand Production using Natural Language 3D Generative AI and Discrete Robotic Assembly DATE

链接: https://arxiv.org/abs/2409.18390
作者: Alexander Htet Kyaw,Se Hwan Jeon,Miana Smith,Neil Gershenfeld
关键词-EN: generative Artificial Intelligence, Artificial Intelligence, generative Artificial, Artificial, Intelligence
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. An updated version will replace this version

点击查看摘要

Abstract:We present a system that transforms speech into physical objects by combining 3D generative Artificial Intelligence with robotic assembly. The system leverages natural language input to make design and manufacturing more accessible, enabling individuals without expertise in 3D modeling or robotic programming to create physical objects. We propose utilizing discrete robotic assembly of lattice-based voxel components to address the challenges of using generative AI outputs in physical production, such as design variability, fabrication speed, structural integrity, and material waste. The system interprets speech to generate 3D objects, discretizes them into voxel components, computes an optimized assembly sequence, and generates a robotic toolpath. The results are demonstrated through the assembly of various objects, ranging from chairs to shelves, which are prompted via speech and realized within 5 minutes using a 6-axis robotic arm.

[AI-72] Robo-CSK-Organizer: Commonsense Knowledge to Organize Detected Objects for Multipurpose Robots

链接: https://arxiv.org/abs/2409.18385
作者: Rafael Hidalgo,Jesse Parron,Aparna S. Varde,Weitian Wang
关键词-EN: infuses commonsense knowledge, classical knowledge based, context recognition capabilities, task-relevant manner, commonsense knowledge
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents a system called Robo-CSK-Organizer that infuses commonsense knowledge from a classical knowledge based to enhance the context recognition capabilities of robots so as to facilitate the organization of detected objects by classifying them in a task-relevant manner. It is particularly useful in multipurpose robotics. Unlike systems relying solely on deep learning tools such as ChatGPT, the Robo-CSK-Organizer system stands out in multiple avenues as follows. It resolves ambiguities well, and maintains consistency in object placement. Moreover, it adapts to diverse task-based classifications. Furthermore, it contributes to explainable AI, hence helping to improve trust and human-robot collaboration. Controlled experiments performed in our work, simulating domestic robotics settings, make Robo-CSK-Organizer demonstrate superior performance while placing objects in contextually relevant locations. This work highlights the capacity of an AI-based system to conduct commonsense-guided decision-making in robotics closer to the thresholds of human cognition. Hence, Robo-CSK-Organizer makes positive impacts on AI and robotics.

[AI-73] Multi-hypotheses Conditioned Point Cloud Diffusion for 3D Human Reconstruction from Occluded Images NEURIPS2024

链接: https://arxiv.org/abs/2409.18364
作者: Donghwan Kim,Tae-Kyun Kim
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 17 pages, 7 figures, accepted NeurIPS 2024

点击查看摘要

[AI-74] racking Software Security Topics

链接: https://arxiv.org/abs/2409.18351
作者: Phong Minh Vu,Tung Thanh Nguyen
关键词-EN: incidents occur everyday, Software security, Software security incidents, security incidents occur, software security reports
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Software security incidents occur everyday and thousands of software security reports are announced each month. Thus, it is difficult for software security researchers, engineers, and other stakeholders to follow software security topics of their interests in real-time. In this paper, we propose, SOSK, a novel tool for this problem. SOSK allows a user to import a collection of software security reports. It pre-processes and extracts the most important keywords from the textual description of the reports. Based on the similarity of embedding vectors of keywords, SOSK can expand and/or refine a keyword set from a much smaller set of user-provided keywords. Thus, SOSK allows users to define any topic of their interests and retrieve security reports relevant to that topic effectively. Our preliminary evaluation shows that SOSK can expand keywords and retrieve reports relevant to user requests.

[AI-75] A Generalized LLM-Augmented BIM Framework: Application to a Speech-to-BIM system

链接: https://arxiv.org/abs/2409.18345
作者: Ghang Lee,Suhyung Jang,Seokho Hyun
关键词-EN: Performing building information, building information modeling, steep learning curve, heavy cognitive load, cognitive load due
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Performing building information modeling (BIM) tasks is a complex process that imposes a steep learning curve and a heavy cognitive load due to the necessity of remembering sequences of numerous commands. With the rapid advancement of large language models (LLMs), it is foreseeable that BIM tasks, including querying and managing BIM data, 4D and 5D BIM, design compliance checking, or authoring a design, using written or spoken natural language (i.e., text-to-BIM or speech-to-BIM), will soon supplant traditional graphical user interfaces. This paper proposes a generalized LLM-augmented BIM framework to expedite the development of LLM-enhanced BIM applications by providing a step-by-step development process. The proposed framework consists of six steps: interpret-fill-match-structure-execute-check. The paper demonstrates the applicability of the proposed framework through implementing a speech-to-BIM application, NADIA-S (Natural-language-based Architectural Detailing through Interaction with Artificial Intelligence via Speech), using exterior wall detailing as an example.

[AI-76] Improving Agent Behaviors with RL Fine-tuning for Autonomous Driving

链接: https://arxiv.org/abs/2409.18343
作者: Zhenghao Peng,Wenjie Luo,Yiren Lu,Tianyi Shen,Cole Gulino,Ari Seff,Justin Fu
关键词-EN: critical applications including, applications including constructing, including constructing realistic, traffic agents motion, forecasting traffic agents
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A major challenge in autonomous vehicle research is modeling agent behaviors, which has critical applications including constructing realistic and reliable simulations for off-board evaluation and forecasting traffic agents motion for onboard planning. While supervised learning has shown success in modeling agents across various domains, these models can suffer from distribution shift when deployed at test-time. In this work, we improve the reliability of agent behaviors by closed-loop fine-tuning of behavior models with reinforcement learning. Our method demonstrates improved overall performance, as well as improved targeted metrics such as collision rate, on the Waymo Open Sim Agents challenge. Additionally, we present a novel policy evaluation benchmark to directly assess the ability of simulated agents to measure the quality of autonomous vehicle planners and demonstrate the effectiveness of our approach on this new benchmark.

[AI-77] AER-LLM: Ambiguity-aware Emotion Recognition Leveraging Large Language Models

链接: https://arxiv.org/abs/2409.18339
作者: Xin Hong,Yuan Gong,Vidhyasaharan Sethu,Ting Dang
关键词-EN: Large Language Models, Natural Language Processing, Language Models, Language Processing, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 5 pages, 4 figures

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have demonstrated great success in many Natural Language Processing (NLP) tasks. In addition to their cognitive intelligence, exploring their capabilities in emotional intelligence is also crucial, as it enables more natural and empathetic conversational AI. Recent studies have shown LLMs’ capability in recognizing emotions, but they often focus on single emotion labels and overlook the complex and ambiguous nature of human emotions. This study is the first to address this gap by exploring the potential of LLMs in recognizing ambiguous emotions, leveraging their strong generalization capabilities and in-context learning. We design zero-shot and few-shot prompting and incorporate past dialogue as context information for ambiguous emotion recognition. Experiments conducted using three datasets indicate significant potential for LLMs in recognizing ambiguous emotions, and highlight the substantial benefits of including context information. Furthermore, our findings indicate that LLMs demonstrate a high degree of effectiveness in recognizing less ambiguous emotions and exhibit potential for identifying more ambiguous emotions, paralleling human perceptual capabilities.

[AI-78] A Fairness-Driven Method for Learning Human-Compatible Negotiation Strategies EMNLP

链接: https://arxiv.org/abs/2409.18335
作者: Ryan Shea,Zhou Yu
关键词-EN: recent advancements, remains a difficult, difficult domain, NLP, negotiation
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: EMNLP Findings 2024

点击查看摘要

Abstract:Despite recent advancements in AI and NLP, negotiation remains a difficult domain for AI agents. Traditional game theoretic approaches that have worked well for two-player zero-sum games struggle in the context of negotiation due to their inability to learn human-compatible strategies. On the other hand, approaches that only use human data tend to be domain-specific and lack the theoretical guarantees provided by strategies grounded in game theory. Motivated by the notion of fairness as a criterion for optimality in general sum games, we propose a negotiation framework called FDHC which incorporates fairness into both the reward design and search to learn human-compatible negotiation strategies. Our method includes a novel, RL+search technique called LGM-Zero which leverages a pre-trained language model to retrieve human-compatible offers from large action spaces. Our results show that our method is able to achieve more egalitarian negotiation outcomes and improve negotiation quality.

[AI-79] Input-Dependent Power Usage in GPUs

链接: https://arxiv.org/abs/2409.18324
作者: Theo Gregersen,Pratyush Patel,Esha Choukse
关键词-EN: artificial intelligence, upcoming datacenters, high power demands, boom in artificial, major contributors
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:GPUs are known to be power-hungry, and due to the boom in artificial intelligence, they are currently the major contributors to the high power demands of upcoming datacenters. Most GPU usage in these popular workloads consist of large general matrix-matrix multiplications (GEMMs), which have therefore been optimized to achieve high utilization of hardware resources. In this work, we show that modifying the input data to GEMMs, while maintaining the matrix shapes and sizes can notably change the power consumption of these kernels. We experiment with four kinds of input variations: value distribution, bit similarity, placement, and sparsity, across different data types. Our findings indicate that these variations can change the GPU power usage during GEMM by almost 40%. We hypothesize that input-dependent power usage variations occur due to changes in the number of bit flips in the GPUs. We propose leveraging this property through compiler and scheduler optimizations to manage power and reduce energy consumption.

[AI-80] Cross-Institutional Structured Radiology Reporting for Lung Cancer Screening Using a Dynamic Template-Constrained Large Language Model

链接: https://arxiv.org/abs/2409.18319
作者: Chuang Niu,Parisa Kaviani,Qing Lyu,Mannudeep K. Kalra,Christopher T. Whitlow,Ge Wang
关键词-EN: optimizing clinical workflows, Structured radiology reporting, patient outcomes, advantageous for optimizing, optimizing clinical
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Structured radiology reporting is advantageous for optimizing clinical workflows and patient outcomes. Current LLMs in creating structured reports face the challenges of formatting errors, content hallucinations, and privacy leakage concerns when uploaded to external servers. We aim to develop an enhanced open-source LLM for creating structured and standardized LCS reports from free-text descriptions. After institutional IRB approvals, 5,442 de-identified LCS reports from two institutions were retrospectively analyzed. 500 reports were randomly selected from the two institutions evenly and then manually labeled for evaluation. Two radiologists from the two institutions developed a standardized template including 29 features for lung nodule reporting. We proposed template-constrained decoding to enhance state-of-the-art open-source LLMs, including LLAMA, Qwen, and Mistral. The LLM performance was extensively evaluated in terms of F1 score, confidence interval, McNemar test, and z-test. Based on the structured reports created from the large-scale dataset, a nodule-level retrieval system was prototyped and an automatic statistical analysis was performed. Our software, vLLM-structure, is publicly available for local deployment with enhanced LLMs. Our template-constrained decoding approach consistently enhanced the LLM performance on multi-institutional datasets, with neither formatting errors nor content hallucinations. Our method improved the best open-source LLAMA-3.1 405B by up to 10.42%, and outperformed GPT-4o by 17.19%. A novel nodule retrieval system was successfully prototyped and demonstrated on a large-scale multimodal database using our enhanced LLM technologies. The automatically derived statistical distributions were closely consistent with the prior findings in terms of nodule type, location, size, status, and Lung-RADS.

[AI-81] Embodied-RAG: General non-parametric Embodied Memory for Retrieval and Generation

链接: https://arxiv.org/abs/2409.18313
作者: Quanting Xie,So Yeon Min,Tianyi Zhang,Aarav Bajaj,Ruslan Salakhutdinov,Matthew Johnson-Roberson,Yonatan Bisk
关键词-EN: searchable and actionable, robot might explore, retrieval augmented generation, cs.RO, knowledge
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Web: this https URL

点击查看摘要

Abstract:There is no limit to how much a robot might explore and learn, but all of that knowledge needs to be searchable and actionable. Within language research, retrieval augmented generation (RAG) has become the workhouse of large-scale non-parametric knowledge, however existing techniques do not directly transfer to the embodied domain, which is multimodal, data is highly correlated, and perception requires abstraction. To address these challenges, we introduce Embodied-RAG, a framework that enhances the foundational model of an embodied agent with a non-parametric memory system capable of autonomously constructing hierarchical knowledge for both navigation and language generation. Embodied-RAG handles a full range of spatial and semantic resolutions across diverse environments and query types, whether for a specific object or a holistic description of ambiance. At its core, Embodied-RAG’s memory is structured as a semantic forest, storing language descriptions at varying levels of detail. This hierarchical organization allows the system to efficiently generate context-sensitive outputs across different robotic platforms. We demonstrate that Embodied-RAG effectively bridges RAG to the robotics domain, successfully handling over 200 explanation and navigation queries across 19 environments, highlighting its promise for general-purpose non-parametric system for embodied agents. Comments: Web: this https URL Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2409.18313 [cs.RO] (or arXiv:2409.18313v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2409.18313 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-82] Harnessing Wavelet Transformations for Generalizable Deepfake Forgery Detection

链接: https://arxiv.org/abs/2409.18301
作者: Lalith Bharadwaj Baru,Shilhora Akshay Patel,Rohit Boddeda
关键词-EN: digital image manipulation, significantly challenges existing, deep generative models, challenges existing deepfake, significantly challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The evolution of digital image manipulation, particularly with the advancement of deep generative models, significantly challenges existing deepfake detection methods, especially when the origin of the deepfake is obscure. To tackle the increasing complexity of these forgeries, we propose \textbfWavelet-CLIP, a deepfake detection framework that integrates wavelet transforms with features derived from the ViT-L/14 architecture, pre-trained in the CLIP fashion. Wavelet-CLIP utilizes Wavelet Transforms to deeply analyze both spatial and frequency features from images, thus enhancing the model’s capability to detect sophisticated deepfakes. To verify the effectiveness of our approach, we conducted extensive evaluations against existing state-of-the-art methods for cross-dataset generalization and detection of unseen images generated by standard diffusion models. Our method showcases outstanding performance, achieving an average AUC of 0.749 for cross-data generalization and 0.893 for robustness against unseen deepfakes, outperforming all compared methods. The code can be reproduced from the repo: \urlthis https URL

[AI-83] SOAR: Self-supervision Optimized UAV Action Recognition with Efficient Object-Aware Pretraining

链接: https://arxiv.org/abs/2409.18300
作者: Ruiqi Xian,Xiyang Wu,Tianrui Guan,Xijun Wang,Boqing Gong,Dinesh Manocha
关键词-EN: Unmanned Aerial Vehicles, aerial footage captured, Aerial Vehicles, Unmanned Aerial, captured by Unmanned
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:We introduce SOAR, a novel Self-supervised pretraining algorithm for aerial footage captured by Unmanned Aerial Vehicles (UAVs). We incorporate human object knowledge throughout the pretraining process to enhance UAV video pretraining efficiency and downstream action recognition performance. This is in contrast to prior works that primarily incorporate object information during the fine-tuning stage. Specifically, we first propose a novel object-aware masking strategy designed to retain the visibility of certain patches related to objects throughout the pretraining phase. Second, we introduce an object-aware loss function that utilizes object information to adjust the reconstruction loss, preventing bias towards less informative background patches. In practice, SOAR with a vanilla ViT backbone, outperforms best UAV action recognition models, recording a 9.7% and 21.4% boost in top-1 accuracy on the NEC-Drone and UAV-Human datasets, while delivering an inference speed of 18.7ms per video, making it 2x to 5x faster. Additionally, SOAR obtains comparable accuracy to prior self-supervised learning (SSL) methods while requiring 87.5% less pretraining time and 25% less memory usage

[AI-84] FlatnFold: A Diverse Multi-Modal Dataset for Garment Perception and Manipulation

链接: https://arxiv.org/abs/2409.18297
作者: Lipeng Zhuang,Shiyu Fan,Yingdong Ru,Florent Audonnet,Paul Henderson,Gerardo Aragon-Camarasa
关键词-EN: addresses critical gaps, addresses critical, critical gaps, large-scale dataset, dataset
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present Flat’n’Fold, a novel large-scale dataset for garment manipulation that addresses critical gaps in existing datasets. Comprising 1,212 human and 887 robot demonstrations of flattening and folding 44 unique garments across 8 categories, Flat’n’Fold surpasses prior datasets in size, scope, and diversity. Our dataset uniquely captures the entire manipulation process from crumpled to folded states, providing synchronized multi-view RGB-D images, point clouds, and action data, including hand or gripper positions and rotations. We quantify the dataset’s diversity and complexity compared to existing benchmarks and show that our dataset features natural and diverse manipulations of real-world demonstrations of human and robot demonstrations in terms of visual and action information. To showcase Flat’n’Fold’s utility, we establish new benchmarks for grasping point prediction and subtask decomposition. Our evaluation of state-of-the-art models on these tasks reveals significant room for improvement. This underscores Flat’n’Fold’s potential to drive advances in robotic perception and manipulation of deformable objects. Our dataset can be downloaded at this https URL

[AI-85] Enhancing Lossy Compression Through Cross-Field Information for Scientific Applications

链接: https://arxiv.org/abs/2409.18295
作者: Youyuan Liu,Wenqi Jia,Taolue Yang,Miao Yin,Sian Jin
关键词-EN: effective methods, methods for reducing, reducing the size, data, multiple data fields
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 9 pages, 9 figures, accepted by DRBSD-10

点击查看摘要

Abstract:Lossy compression is one of the most effective methods for reducing the size of scientific data containing multiple data fields. It reduces information density through prediction or transformation techniques to compress the data. Previous approaches use local information from a single target field when predicting target data points, limiting their potential to achieve higher compression ratios. In this paper, we identified significant cross-field correlations within scientific datasets. We propose a novel hybrid prediction model that utilizes CNN to extract cross-field information and combine it with existing local field information. Our solution enhances the prediction accuracy of lossy compressors, leading to improved compression ratios without compromising data quality. We evaluate our solution on three scientific datasets, demonstrating its ability to improve compression ratios by up to 25% under specific error bounds. Additionally, our solution preserves more data details and reduces artifacts compared to baseline approaches.

[AI-86] Retrospective Comparative Analysis of Prostate Cancer In-Basket Messages: Responses from Closed-Domain LLM vs. Clinical Teams

链接: https://arxiv.org/abs/2409.18290
作者: Yuexing Hao,Jason M. Holmes,Jared Hobson,Alexandra Bennett,Daniel K. Ebner,David M. Routman,Satomi Shiraishi,Samir H. Patel,Nathan Y. Yu,Chris L. Hallemeier,Brooke E. Ball,Mark R. Waddle,Wei Liu
关键词-EN: patient care journey, physician-patient communication, Large Language Model, play a crucial, crucial role
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:In-basket message interactions play a crucial role in physician-patient communication, occurring during all phases (pre-, during, and post) of a patient’s care journey. However, responding to these patients’ inquiries has become a significant burden on healthcare workflows, consuming considerable time for clinical care teams. To address this, we introduce RadOnc-GPT, a specialized Large Language Model (LLM) powered by GPT-4 that has been designed with a focus on radiotherapeutic treatment of prostate cancer with advanced prompt engineering, and specifically designed to assist in generating responses. We integrated RadOnc-GPT with patient electronic health records (EHR) from both the hospital-wide EHR database and an internal, radiation-oncology-specific database. RadOnc-GPT was evaluated on 158 previously recorded in-basket message interactions. Quantitative natural language processing (NLP) analysis and two grading studies with clinicians and nurses were used to assess RadOnc-GPT’s responses. Our findings indicate that RadOnc-GPT slightly outperformed the clinical care team in “Clarity” and “Empathy,” while achieving comparable scores in “Completeness” and “Correctness.” RadOnc-GPT is estimated to save 5.2 minutes per message for nurses and 2.4 minutes for clinicians, from reading the inquiry to sending the response. Employing RadOnc-GPT for in-basket message draft generation has the potential to alleviate the workload of clinical care teams and reduce healthcare costs by producing high-quality, timely responses.

[AI-87] Criticality and Safety Margins for Reinforcement Learning

链接: https://arxiv.org/abs/2409.18289
作者: Alexander Grushin,Walt Woods,Alvaro Velasquez,Simon Khan
关键词-EN: art reinforcement learning, reinforcement learning methods, encounter unsafe situations, art reinforcement, reinforcement learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: 17 pages, 10 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:State of the art reinforcement learning methods sometimes encounter unsafe situations. Identifying when these situations occur is of interest both for post-hoc analysis and during deployment, where it might be advantageous to call out to a human overseer for help. Efforts to gauge the criticality of different points in time have been developed, but their accuracy is not well established due to a lack of ground truth, and they are not designed to be easily interpretable by end users. Therefore, we seek to define a criticality framework with both a quantifiable ground truth and a clear significance to users. We introduce true criticality as the expected drop in reward when an agent deviates from its policy for n consecutive random actions. We also introduce the concept of proxy criticality, a low-overhead metric that has a statistically monotonic relationship to true criticality. Safety margins make these interpretable, when defined as the number of random actions for which performance loss will not exceed some tolerance with high confidence. We demonstrate this approach in several environment-agent combinations; for an A3C agent in an Atari Beamrider environment, the lowest 5% of safety margins contain 47% of agent losses; i.e., supervising only 5% of decisions could potentially prevent roughly half of an agent’s errors. This criticality framework measures the potential impacts of bad decisions, even before those decisions are made, allowing for more effective debugging and oversight of autonomous agents.

[AI-88] Advancing Object Detection in Transportation with Multimodal Large Language Models (MLLMs): A Comprehensive Review and Empirical Testing

链接: https://arxiv.org/abs/2409.18286
作者: Huthaifa I. Ashqar,Ahmed Jaber,Taqwa I. Alhadidi,Mohammed Elhenawy
关键词-EN: Large Vision Models, large language models, multimodal large language, Vision Models, Large Vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:This study aims to comprehensively review and empirically evaluate the application of multimodal large language models (MLLMs) and Large Vision Models (VLMs) in object detection for transportation systems. In the first fold, we provide a background about the potential benefits of MLLMs in transportation applications and conduct a comprehensive review of current MLLM technologies in previous studies. We highlight their effectiveness and limitations in object detection within various transportation scenarios. The second fold involves providing an overview of the taxonomy of end-to-end object detection in transportation applications and future directions. Building on this, we proposed empirical analysis for testing MLLMs on three real-world transportation problems that include object detection tasks namely, road safety attributes extraction, safety-critical event detection, and visual reasoning of thermal images. Our findings provide a detailed assessment of MLLM performance, uncovering both strengths and areas for improvement. Finally, we discuss practical limitations and challenges of MLLMs in enhancing object detection in transportation, thereby offering a roadmap for future research and development in this critical area.

[AI-89] Omni6D: Large-Vocabulary 3D Object Dataset for Category-Level 6D Object Pose Estimation ECCV2024

链接: https://arxiv.org/abs/2409.18261
作者: Mengchen Zhang,Tong Wu,Tai Wang,Tengfei Wang,Ziwei Liu,Dahua Lin
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: ECCV 2024 (poster). Github page: this https URL

点击查看摘要

[AI-90] PCEvE: Part Contribution Evaluation Based Model Explanation for Human Figure Drawing Assessment and Beyond

链接: https://arxiv.org/abs/2409.18260
作者: Jongseo Lee,Geo Ahn,Jinwoo Choi,Seongtae Kim
关键词-EN: human figure drawing, autism spectrum disorder, automatic human figure, diagnosing autism spectrum, figure drawing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:For automatic human figure drawing (HFD) assessment tasks, such as diagnosing autism spectrum disorder (ASD) using HFD images, the clarity and explainability of a model decision are crucial. Existing pixel-level attribution-based explainable AI (XAI) approaches demand considerable effort from users to interpret the semantic information of a region in an image, which can be often time-consuming and impractical. To overcome this challenge, we propose a part contribution evaluation based model explanation (PCEvE) framework. On top of the part detection, we measure the Shapley Value of each individual part to evaluate the contribution to a model decision. Unlike existing attribution-based XAI approaches, the PCEvE provides a straightforward explanation of a model decision, i.e., a part contribution histogram. Furthermore, the PCEvE expands the scope of explanations beyond the conventional sample-level to include class-level and task-level insights, offering a richer, more comprehensive understanding of model behavior. We rigorously validate the PCEvE via extensive experiments on multiple HFD assessment datasets. Also, we sanity-check the proposed method with a set of controlled experiments. Additionally, we demonstrate the versatility and applicability of our method to other domains by applying it to a photo-realistic dataset, the Stanford Cars.

[AI-91] rustworthy AI: Securing Sensitive Data in Large Language Models

链接: https://arxiv.org/abs/2409.18222
作者: Georgios Feretzakis,Vassilios S. Verykios
关键词-EN: Large Language Models, natural language processing, transformed natural language, Large Language, Language Models
类目: Artificial Intelligence (cs.AI)
*备注: 40 pages, 1 figure

点击查看摘要

Abstract:Large Language Models (LLMs) have transformed natural language processing (NLP) by enabling robust text generation and understanding. However, their deployment in sensitive domains like healthcare, finance, and legal services raises critical concerns about privacy and data security. This paper proposes a comprehensive framework for embedding trust mechanisms into LLMs to dynamically control the disclosure of sensitive information. The framework integrates three core components: User Trust Profiling, Information Sensitivity Detection, and Adaptive Output Control. By leveraging techniques such as Role-Based Access Control (RBAC), Attribute-Based Access Control (ABAC), Named Entity Recognition (NER), contextual analysis, and privacy-preserving methods like differential privacy, the system ensures that sensitive information is disclosed appropriately based on the user’s trust level. By focusing on balancing data utility and privacy, the proposed solution offers a novel approach to securely deploying LLMs in high-risk environments. Future work will focus on testing this framework across various domains to evaluate its effectiveness in managing sensitive data while maintaining system efficiency.

[AI-92] MMMT-IF: A Challenging Multimodal Multi-Turn Instruction Following Benchmark

链接: https://arxiv.org/abs/2409.18216
作者: Elliot L. Epstein,Kaisheng Yao,Jing Li,Xinyi Bai,Hamid Palangi
关键词-EN: Evaluating instruction, instructions, operatorname, PIF, capabilities for multimodal
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 24 pages, 16 figures

点击查看摘要

Abstract:Evaluating instruction following capabilities for multimodal, multi-turn dialogue is challenging. With potentially multiple instructions in the input model context, the task is time-consuming for human raters and we show LLM based judges are biased towards answers from the same model. We propose MMMT-IF, an image based multi-turn Q \ A evaluation set with added global instructions between questions, constraining the answer format. This challenges models to retrieve instructions dispersed across long dialogues and reason under instruction constraints. All instructions are objectively verifiable through code execution. We introduce the Programmatic Instruction Following ( \operatornamePIF ) metric to measure the fraction of the instructions that are correctly followed while performing a reasoning task. The \operatornamePIF-N-K set of metrics further evaluates robustness by measuring the fraction of samples in a corpus where, for each sample, at least K out of N generated model responses achieve a \operatornamePIF score of one. The \operatornamePIF metric aligns with human instruction following ratings, showing 60 percent correlation. Experiments show Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet, have a \operatornamePIF metric that drops from 0.81 on average at turn 1 across the models, to 0.64 at turn 20. Across all turns, when each response is repeated 4 times ( \operatornamePIF-4-4 ), GPT-4o and Gemini successfully follow all instructions only 11% of the time. When all the instructions are also appended to the end of the model input context, the \operatornamePIF metric improves by 22.3 points on average, showing that the challenge with the task lies not only in following the instructions, but also in retrieving the instructions spread out in the model context. We plan to open source the MMMT-IF dataset and metric computation code.

[AI-93] AI Policy Projector: Grounding LLM Policy Design in Iterative Mapmaking

链接: https://arxiv.org/abs/2409.18203
作者: Michelle S. Lam,Fred Hohman,Dominik Moritz,Jeffrey P. Bigham,Kenneth Holstein,Mary Beth Kery
关键词-EN: large language model, implicit reward model, large language, explicit constitution, implicit reward
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Whether a large language model policy is an explicit constitution or an implicit reward model, it is challenging to assess coverage over the unbounded set of real-world situations that a policy must contend with. We introduce an AI policy design process inspired by mapmaking, which has developed tactics for visualizing and iterating on maps even when full coverage is not possible. With Policy Projector, policy designers can survey the landscape of model input-output pairs, define custom regions (e.g., “violence”), and navigate these regions with rules that can be applied to LLM outputs (e.g., if output contains “violence” and “graphic details,” then rewrite without “graphic details”). Policy Projector supports interactive policy authoring using LLM classification and steering and a map visualization reflecting the policy designer’s work. In an evaluation with 12 AI safety experts, our system helps policy designers to address problematic model behaviors extending beyond an existing, comprehensive harm taxonomy.

[AI-94] Autonomous Network Defence using Reinforcement Learning

链接: https://arxiv.org/abs/2409.18197
作者: Myles Foley,Chris Hicks,Kate Highnam,Vasilios Mavroudis
关键词-EN: security arms race, network security arms, arms race, security arms, defender is significantly
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the network security arms race, the defender is significantly disadvantaged as they need to successfully detect and counter every malicious attack. In contrast, the attacker needs to succeed only once. To level the playing field, we investigate the effectiveness of autonomous agents in a realistic network defence scenario. We first outline the problem, provide the background on reinforcement learning and detail our proposed agent design. Using a network environment simulation, with 13 hosts spanning 3 subnets, we train a novel reinforcement learning agent and show that it can reliably defend continual attacks by two advanced persistent threat (APT) red agents: one with complete knowledge of the network layout and another which must discover resources through exploration but is more general.

[AI-95] Evaluation of Large Language Models for Summarization Tasks in the Medical Domain: A Narrative Review

链接: https://arxiv.org/abs/2409.18170
作者: Emma Croxford,Yanjun Gao,Nicholas Pellegrino,Karen K. Wong,Graham Wills,Elliot First,Frank J. Liao,Cherodeep Goswami,Brian Patterson,Majid Afshar
关键词-EN: Large Language Models, Natural Language Generation, clinical Natural Language, Large Language, Language Generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models have advanced clinical Natural Language Generation, creating opportunities to manage the volume of medical text. However, the high-stakes nature of medicine requires reliable evaluation, which remains a challenge. In this narrative review, we assess the current evaluation state for clinical summarization tasks and propose future directions to address the resource constraints of expert human evaluation.

[AI-96] Harmful Fine-tuning Attacks and Defenses for Large Language Models : A Survey

链接: https://arxiv.org/abs/2409.18169
作者: Tiansheng Huang,Sihao Hu,Fatih Ilhan,Selim Furkan Tekin,Ling Liu
关键词-EN: Recent research demonstrates, harmful data uploaded, business model exposes, Recent research, harmful fine-tuning attack
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent research demonstrates that the nascent fine-tuning-as-a-service business model exposes serious safety concerns – fine-tuning over a few harmful data uploaded by the users can compromise the safety alignment of the model. The attack, known as harmful fine-tuning, has raised a broad research interest among the community. However, as the attack is still new, \textbfwe observe from our miserable submission experience that there are general misunderstandings within the research community. We in this paper aim to clear some common concerns for the attack setting, and formally establish the research problem. Specifically, we first present the threat model of the problem, and introduce the harmful fine-tuning attack and its variants. Then we systematically survey the existing literature on attacks/defenses/mechanical analysis of the problem. Finally, we outline future research directions that might contribute to the development of the field. Additionally, we present a list of questions of interest, which might be useful to refer to when reviewers in the peer review process question the realism of the experiment/attack/defense setting. A curated list of relevant papers is maintained and made accessible at: \urlthis https URL.

[AI-97] Data-Prep-Kit: getting your data ready for LLM application development

链接: https://arxiv.org/abs/2409.18164
作者: David Wood,Boris Lublinsky,Alexy Roytman,Shivdeep Singh,Abdulhamid Adebayo,Revital Eres,Mohammad Nassar,Hima Patel,Yousaf Shah,Constantin Adam,Petros Zerfos,Nirmit Desai,Daiki Tsuzuku,Takuya Goto,Michele Dolfi,Saptha Surendran,Paramesvaran Selvam,Sungeun An,Yuan Chi Chang,Dhiraj Joshi,Hajar Emami-Gohari,Xuan-Hong Dang,Yan Koyfman,Shahrokh Daijavad
关键词-EN: Data Prep Kit, DPK, Data preparation, Prep Kit, Data
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 10 pages, 7 figures

点击查看摘要

Abstract:Data preparation is the first and a very important step towards any Large Language Model (LLM) development. This paper introduces an easy-to-use, extensible, and scale-flexible open-source data preparation toolkit called Data Prep Kit (DPK). DPK is architected and designed to enable users to scale their data preparation to their needs. With DPK they can prepare data on a local machine or effortlessly scale to run on a cluster with thousands of CPU Cores. DPK comes with a highly scalable, yet extensible set of modules that transform natural language and code data. If the user needs additional transforms, they can be easily developed using extensive DPK support for transform creation. These modules can be used independently or pipelined to perform a series of operations. In this paper, we describe DPK architecture and show its performance from a small scale to a very large number of CPUs. The modules from DPK have been used for the preparation of Granite Models [1] [2]. We believe DPK is a valuable contribution to the AI community to easily prepare data to enhance the performance of their LLM models or to fine-tune models with Retrieval-Augmented Generation (RAG).

[AI-98] A Survey on Neural Architecture Search Based on Reinforcement Learning

链接: https://arxiv.org/abs/2409.18163
作者: Wenzhu Shao
关键词-EN: Neural Architecture Search, Architecture Search, Neural Architecture, feature extraction, extraction of machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The automation of feature extraction of machine learning has been successfully realized by the explosive development of deep learning. However, the structures and hyperparameters of deep neural network architectures also make huge difference on the performance in different tasks. The process of exploring optimal structures and hyperparameters often involves a lot of tedious human intervene. As a result, a legitimate question is to ask for the automation of searching for optimal network structures and hyperparameters. The work of automation of exploring optimal hyperparameters is done by Hyperparameter Optimization. Neural Architecture Search is aimed to automatically find the best network structure given specific tasks. In this paper, we firstly introduced the overall development of Neural Architecture Search and then focus mainly on providing an overall and understandable survey about Neural Architecture Search works that are relevant with reinforcement learning, including improvements and variants based on the hope of satisfying more complex structures and resource-insufficient environment.

[AI-99] he Nexus of AR/VR Large Language Models UI/UX and Robotics Technologies in Enhancing Learning and Social Interaction for Children: A Systematic Review

链接: https://arxiv.org/abs/2409.18162
作者: Biplov Paneru,Bishwash Paneru
关键词-EN: autism spectrum disorder, large language models, user interface, user experience, augmented reality
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: none

点击查看摘要

Abstract:The combination of large language models (LLMs), augmented reality (AR), and user interface/user experience (UI/UX) design in therapies for children, especially with disorders like autism spectrum disorder (ASD), is examined in this review study. 150 publications were found by a thorough literature search throughout PubMed, ACM, IEEE Xplore, Elsevier, and Google Scholar; 42 of them were chosen for in-depth study due to their methodological rigor and relevance. Three primary areas are covered in this review: how AR can improve social and learning results; how LLMs can help with communication; and how UI/UX design affects how effective these technologies are. Results reveal that while LLMs can provide individualized learning and communication support, AR has demonstrated promise in enhancing social skills, motivation, and attention. For children with ASD, accessible and interesting interventions depend heavily on effective UI/UX design. To optimize the benefits of these technologies in ASD therapies, the study emphasizes the need for additional research to address difficulties related to customization, accessibility, and integration.

[AI-100] A Survey on Multimodal Benchmarks: In the Era of Large AI Models

链接: https://arxiv.org/abs/2409.18142
作者: Lin Li,Guikun Chen,Hanrong Shi,Jun Xiao,Long Chen
关键词-EN: Multimodal Large Language, Large Language Models, generate multimodal content, Large Language, Multimodal Large
类目: Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: Ongoing project

点击查看摘要

Abstract:The rapid evolution of Multimodal Large Language Models (MLLMs) has brought substantial advancements in artificial intelligence, significantly enhancing the capability to understand and generate multimodal content. While prior studies have largely concentrated on model architectures and training methodologies, a thorough analysis of the benchmarks used for evaluating these models remains underexplored. This survey addresses this gap by systematically reviewing 211 benchmarks that assess MLLMs across four core domains: understanding, reasoning, generation, and application. We provide a detailed analysis of task designs, evaluation metrics, and dataset constructions, across diverse modalities. We hope that this survey will contribute to the ongoing advancement of MLLM research by offering a comprehensive overview of benchmarking practices and identifying promising directions for future work. An associated GitHub repository collecting the latest papers is available.

[AI-101] Unconditional stability of a recurrent neural circuit implementing divisive normalization

链接: https://arxiv.org/abs/2409.18946
作者: Shivang Rawat,David J. Heeger,Stefano Martiniani
关键词-EN:
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

[AI-102] Positional Encoder Graph Quantile Neural Networks for Geographic Data

链接: https://arxiv.org/abs/2409.18865
作者: William E. R. de Amorim,Scott A. Sisson,T. Rodrigues,David J. Nott,Guilherme S. Rodrigues
关键词-EN: Positional Encoder Graph, Encoder Graph Neural, Graph Neural Networks, Graph Quantile Neural, Encoder Graph Quantile
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 17 main text pages, 4 figures

点击查看摘要

Abstract:Positional Encoder Graph Neural Networks (PE-GNNs) are a leading approach for modeling continuous spatial data. However, they often fail to produce calibrated predictive distributions, limiting their effectiveness for uncertainty quantification. We introduce the Positional Encoder Graph Quantile Neural Network (PE-GQNN), a novel method that integrates PE-GNNs, Quantile Neural Networks, and recalibration techniques in a fully nonparametric framework, requiring minimal assumptions about the predictive distributions. We propose a new network architecture that, when combined with a quantile-based loss function, yields accurate and reliable probabilistic models without increasing computational complexity. Our approach provides a flexible, robust framework for conditional density estimation, applicable beyond spatial data contexts. We further introduce a structured method for incorporating a KNN predictor into the model while avoiding data leakage through the GNN layer operation. Experiments on benchmark datasets demonstrate that PE-GQNN significantly outperforms existing state-of-the-art methods in both predictive accuracy and uncertainty quantification.

[AI-103] MECG-E: Mamba-based ECG Enhancer for Baseline Wander Removal

链接: https://arxiv.org/abs/2409.18828
作者: Kuo-Hsuan Hung,Kuan-Chen Wang,Kai-Chun Liu,Wei-Lun Chen,Xugang Lu,Yu Tsao,Chii-Wann Lin
关键词-EN: diagnosing cardiovascular disease, important non-invasive method, cardiovascular disease, Mamba-based ECG Enhancer, important non-invasive
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注: 7 pages, 5 figures

点击查看摘要

Abstract:Electrocardiogram (ECG) is an important non-invasive method for diagnosing cardiovascular disease. However, ECG signals are susceptible to noise contamination, such as electrical interference or signal wandering, which reduces diagnostic accuracy. Various ECG denoising methods have been proposed, but most existing methods yield suboptimal performance under very noisy conditions or require several steps during inference, leading to latency during online processing. In this paper, we propose a novel ECG denoising model, namely Mamba-based ECG Enhancer (MECG-E), which leverages the Mamba architecture known for its fast inference and outstanding nonlinear mapping capabilities. Experimental results indicate that MECG-E surpasses several well-known existing models across multiple metrics under different noise conditions. Additionally, MECG-E requires less inference time than state-of-the-art diffusion-based ECG denoisers, demonstrating the model’s functionality and efficiency.

[AI-104] Early diagnosis of Alzheimers disease from MRI images with deep learning model

链接: https://arxiv.org/abs/2409.18814
作者: Sajjad Aghasi Javid,Mahmood Mohassel Feghhi
关键词-EN: worldwide is Alzheimer, Alzheimer disease, Minority Oversampling Technique, Alzheimer, Synthetic Minority Oversampling
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 7 pages, 3 figures, Presented at the 20-th CSI International Symposium on Artificial Intelligence and Signal Processing (AISP) 21-22 February, 2024, Mazandaran University of Science and Technology, Babol, Iran

点击查看摘要

Abstract:It is acknowledged that the most common cause of dementia worldwide is Alzheimer’s disease (AD). This condition progresses in severity from mild to severe and interferes with people’s everyday routines. Early diagnosis plays a critical role in patient care and clinical trials. Convolutional neural networks (CNN) are used to create a framework for identifying specific disease features from MRI scans Classification of dementia involves approaches such as medical history review, neuropsychological tests, and magnetic resonance imaging (MRI). However, the image dataset obtained from Kaggle faces a significant issue of class imbalance, which requires equal distribution of samples from each class to address. In this article, to address this imbalance, the Synthetic Minority Oversampling Technique (SMOTE) is utilized. Furthermore, a pre-trained convolutional neural network has been applied to the DEMNET dementia network to extract key features from AD images. The proposed model achieved an impressive accuracy of 98.67%.

[AI-105] Multi-modal Medical Image Fusion For Non-Small Cell Lung Cancer Classification

链接: https://arxiv.org/abs/2409.18715
作者: Salma Hassan,Hamad Al Hammadi,Ibrahim Mohammed,Muhammad Haris Khan
关键词-EN: cancer mortality worldwide, nuanced subtype classification, non-small cell lung, mortality worldwide, complex issue
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The early detection and nuanced subtype classification of non-small cell lung cancer (NSCLC), a predominant cause of cancer mortality worldwide, is a critical and complex issue. In this paper, we introduce an innovative integration of multi-modal data, synthesizing fused medical imaging (CT and PET scans) with clinical health records and genomic data. This unique fusion methodology leverages advanced machine learning models, notably MedClip and BEiT, for sophisticated image feature extraction, setting a new standard in computational oncology. Our research surpasses existing approaches, as evidenced by a substantial enhancement in NSCLC detection and classification precision. The results showcase notable improvements across key performance metrics, including accuracy, precision, recall, and F1-score. Specifically, our leading multi-modal classifier model records an impressive accuracy of 94.04%. We believe that our approach has the potential to transform NSCLC diagnostics, facilitating earlier detection and more effective treatment planning and, ultimately, leading to superior patient outcomes in lung cancer care.

[AI-106] Speech Boosting: Low-Latency Live Speech Enhancement for TWS Earbuds INTERSPEECH2024

链接: https://arxiv.org/abs/2409.18705
作者: Hanbin Bae,Pavel Andreev,Azat Saginbaev,Nicholas Babaev,Won-Jun Lee,Hosang Sung,Hoon-Young Cho
关键词-EN: true wireless stereo, earbuds on-device usage, wireless stereo, enhancement solution tailored, paper introduces
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD); Signal Processing (eess.SP)
*备注: Accepted by Interspeech 2024

点击查看摘要

Abstract:This paper introduces a speech enhancement solution tailored for true wireless stereo (TWS) earbuds on-device usage. The solution was specifically designed to support conversations in noisy environments, with active noise cancellation (ANC) activated. The primary challenges for speech enhancement models in this context arise from computational complexity that limits on-device usage and latency that must be less than 3 ms to preserve a live conversation. To address these issues, we evaluated several crucial design elements, including the network architecture and domain, design of loss functions, pruning method, and hardware-specific optimization. Consequently, we demonstrated substantial improvements in speech enhancement quality compared with that in baseline models, while simultaneously reducing the computational complexity and algorithmic latency.

[AI-107] MG-Net: Learn to Customize QAOA with Circuit Depth Awareness

链接: https://arxiv.org/abs/2409.18692
作者: Yang Qian,Xinbiao Wang,Yuxuan Du,Yong Luo,Dacheng Tao
关键词-EN: Approximate Optimization Algorithm, Quantum Approximate Optimization, combinatorial optimization challenges, tackling combinatorial optimization, variants exhibit immense
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 29 pages, 16 figures

点击查看摘要

Abstract:Quantum Approximate Optimization Algorithm (QAOA) and its variants exhibit immense potential in tackling combinatorial optimization challenges. However, their practical realization confronts a dilemma: the requisite circuit depth for satisfactory performance is problem-specific and often exceeds the maximum capability of current quantum devices. To address this dilemma, here we first analyze the convergence behavior of QAOA, uncovering the origins of this dilemma and elucidating the intricate relationship between the employed mixer Hamiltonian, the specific problem at hand, and the permissible maximum circuit depth. Harnessing this understanding, we introduce the Mixer Generator Network (MG-Net), a unified deep learning framework adept at dynamically formulating optimal mixer Hamiltonians tailored to distinct tasks and circuit depths. Systematic simulations, encompassing Ising models and weighted Max-Cut instances with up to 64 qubits, substantiate our theoretical findings, highlighting MG-Net’s superior performance in terms of both approximation ratio and efficiency.

[AI-108] Effects of AI Feedback on Learning the Skill Gap and Intellectual Diversity

链接: https://arxiv.org/abs/2409.18660
作者: Christoph Riedl,Eric Bogert
关键词-EN: human decision-makers learn, feedback, seek AI feedback, human decision-makers, decision-makers learn
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Can human decision-makers learn from AI feedback? Using data on 52,000 decision-makers from a large online chess platform, we investigate how their AI use affects three interrelated long-term outcomes: Learning, skill gap, and diversity of decision strategies. First, we show that individuals are far more likely to seek AI feedback in situations in which they experienced success rather than failure. This AI feedback seeking strategy turns out to be detrimental to learning: Feedback on successes decreases future performance, while feedback on failures increases it. Second, higher-skilled decision-makers seek AI feedback more often and are far more likely to seek AI feedback after a failure, and benefit more from AI feedback than lower-skilled individuals. As a result, access to AI feedback increases, rather than decreases, the skill gap between high- and low-skilled individuals. Finally, we leverage 42 major platform updates as natural experiments to show that access to AI feedback causes a decrease in intellectual diversity of the population as individuals tend to specialize in the same areas. Together, those results indicate that learning from AI feedback is not automatic and using AI correctly seems to be a skill itself. Furthermore, despite its individual-level benefits, access to AI feedback can have significant population-level downsides including loss of intellectual diversity and an increasing skill gap.

[AI-109] Quantum Algorithms for Drone Mission Planning

链接: https://arxiv.org/abs/2409.18631
作者: Ethan Davies,Pranav Kalidindi
关键词-EN: Surveillance and Reconnaissance, allowed parameters subject, assets in order, allowed parameters, parameters subject
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注: 14 pages, 7 figures

点击查看摘要

Abstract:Mission planning often involves optimising the use of ISR (Intelligence, Surveillance and Reconnaissance) assets in order to achieve a set of mission objectives within allowed parameters subject to constraints. The missions of interest here, involve routing multiple UAVs visiting multiple targets, utilising sensors to capture data relating to each target. Finding such solutions is often an NP-Hard problem and cannot be solved efficiently on classical computers. Furthermore, during the mission new constraints and objectives may arise, requiring a new solution to be computed within a short time period. To achieve this we investigate near term quantum algorithms that have the potential to offer speed-ups against current classical methods. We demonstrate how a large family of these problems can be formulated as a Mixed Integer Linear Program (MILP) and then converted to a Quadratic Unconstrained Binary Optimisation (QUBO). The formulation provided is versatile and can be adapted for many different constraints with clear qubit scaling provided. We discuss the results of solving the QUBO formulation using commercial quantum annealers and compare the solutions to current edge classical solvers. We also analyse the results from solving the QUBO using Quantum Approximate Optimisation Algorithms (QAOA) and discuss their results. Finally, we also provide efficient methods to encode to the problem into the Variational Quantum Eigensolver (VQE) formalism, where we have tailored the ansatz to the problem making efficient use of the qubits available.

[AI-110] owards Integrating Epistemic Uncertainty Estimation into the Radiotherapy Workflow

链接: https://arxiv.org/abs/2409.18628
作者: Marvin Tom Teichmann,Manasi Datar,Lisa Kratzke,Fernando Vega,Florin C. Ghesu
关键词-EN: contouring target structures, ensuring treatment efficacy, epistemic uncertainty estimation, uncertainty estimation, OOD detection
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Keywords: Epistemic Uncertainty - Out-of-Distribution Detection - CT Segmentation - OAR contouring - Radiotherapy

点击查看摘要

Abstract:The precision of contouring target structures and organs-at-risk (OAR) in radiotherapy planning is crucial for ensuring treatment efficacy and patient safety. Recent advancements in deep learning (DL) have significantly improved OAR contouring performance, yet the reliability of these models, especially in the presence of out-of-distribution (OOD) scenarios, remains a concern in clinical settings. This application study explores the integration of epistemic uncertainty estimation within the OAR contouring workflow to enable OOD detection in clinically relevant scenarios, using specifically compiled data. Furthermore, we introduce an advanced statistical method for OOD detection to enhance the methodological framework of uncertainty estimation. Our empirical evaluation demonstrates that epistemic uncertainty estimation is effective in identifying instances where model predictions are unreliable and may require an expert review. Notably, our approach achieves an AUC-ROC of 0.95 for OOD detection, with a specificity of 0.95 and a sensitivity of 0.92 for implant cases, underscoring its efficacy. This study addresses significant gaps in the current research landscape, such as the lack of ground truth for uncertainty estimation and limited empirical evaluations. Additionally, it provides a clinically relevant application of epistemic uncertainty estimation in an FDA-approved and widely used clinical solution for OAR segmentation from Varian, a Siemens Healthineers company, highlighting its practical benefits.

[AI-111] MIMII-Gen: Generative Modeling Approach for Simulated Evaluation of Anomalous Sound Detection System

链接: https://arxiv.org/abs/2409.18542
作者: Harsh Purohit,Tomoya Nishida,Kota Dohi,Takashi Endo,Yohei Kawaguchi
关键词-EN: present significant challenges, anomalies present significant, Insufficient recordings, present significant, significant challenges
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Insufficient recordings and the scarcity of anomalies present significant challenges in developing and validating robust anomaly detection systems for machine sounds. To address these limitations, we propose a novel approach for generating diverse anomalies in machine sound using a latent diffusion-based model that integrates an encoder-decoder framework. Our method utilizes the Flan-T5 model to encode captions derived from audio file metadata, enabling conditional generation through a carefully designed U-Net architecture. This approach aids our model in generating audio signals within the EnCodec latent space, ensuring high contextual relevance and quality. We objectively evaluated the quality of our generated sounds using the Fréchet Audio Distance (FAD) score and other metrics, demonstrating that our approach surpasses existing models in generating reliable machine audio that closely resembles actual abnormal conditions. The evaluation of the anomaly detection system using our generated data revealed a strong correlation, with the area under the curve (AUC) score differing by 4.8% from the original, validating the effectiveness of our generated data. These results demonstrate the potential of our approach to enhance the evaluation and robustness of anomaly detection systems across varied and previously unseen conditions. Audio samples can be found at \urlthis https URL.

[AI-112] Adaptive Learning of the Latent Space of Wasserstein Generative Adversarial Networks

链接: https://arxiv.org/abs/2409.18374
作者: Yixuan Qiu,Qingyi Gao,Xiao Wang
关键词-EN: Generative models based, latent Wasserstein GAN, generative adversarial networks, Wasserstein GAN, intrinsic dimension
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Generative models based on latent variables, such as generative adversarial networks (GANs) and variational auto-encoders (VAEs), have gained lots of interests due to their impressive performance in many fields. However, many data such as natural images usually do not populate the ambient Euclidean space but instead reside in a lower-dimensional manifold. Thus an inappropriate choice of the latent dimension fails to uncover the structure of the data, possibly resulting in mismatch of latent representations and poor generative qualities. Towards addressing these problems, we propose a novel framework called the latent Wasserstein GAN (LWGAN) that fuses the Wasserstein auto-encoder and the Wasserstein GAN so that the intrinsic dimension of the data manifold can be adaptively learned by a modified informative latent distribution. We prove that there exist an encoder network and a generator network in such a way that the intrinsic dimension of the learned encoding distribution is equal to the dimension of the data manifold. We theoretically establish that our estimated intrinsic dimension is a consistent estimate of the true dimension of the data manifold. Meanwhile, we provide an upper bound on the generalization error of LWGAN, implying that we force the synthetic data distribution to be similar to the real data distribution from a population perspective. Comprehensive empirical experiments verify our framework and show that LWGAN is able to identify the correct intrinsic dimension under several scenarios, and simultaneously generate high-quality synthetic data by sampling from the learned latent distribution.

[AI-113] DRL-STNet: Unsupervised Domain Adaptation for Cross-modality Medical Image Segmentation via Disentangled Representation Learning MICCAI2024

链接: https://arxiv.org/abs/2409.18340
作者: Hui Lin,Florian Schiffers,Santiago López-Tapia,Neda Tavakoli,Daniel Kim,Aggelos K. Katsaggelos
关键词-EN: Unsupervised domain adaptation, cross-modality data scenarios, Unsupervised domain, medical image segmentation, cross-modality medical image
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: MICCAI 2024 Challenge, FLARE Challenge, Unsupervised domain adaptation, Organ segmentation, Feature disentanglement, Self-training

点击查看摘要

Abstract:Unsupervised domain adaptation (UDA) is essential for medical image segmentation, especially in cross-modality data scenarios. UDA aims to transfer knowledge from a labeled source domain to an unlabeled target domain, thereby reducing the dependency on extensive manual annotations. This paper presents DRL-STNet, a novel framework for cross-modality medical image segmentation that leverages generative adversarial networks (GANs), disentangled representation learning (DRL), and self-training (ST). Our method leverages DRL within a GAN to translate images from the source to the target modality. Then, the segmentation model is initially trained with these translated images and corresponding source labels and then fine-tuned iteratively using a combination of synthetic and real images with pseudo-labels and real labels. The proposed framework exhibits superior performance in abdominal organ segmentation on the FLARE challenge dataset, surpassing state-of-the-art methods by 11.4% in the Dice similarity coefficient and by 13.1% in the Normalized Surface Dice metric, achieving scores of 74.21% and 80.69%, respectively. The average running time is 41 seconds, and the area under the GPU memory-time curve is 11,292 MB. These results indicate the potential of DRL-STNet for enhancing cross-modality medical image segmentation tasks.

[AI-114] Decomposition of one-layer neural networks via the infinite sum of reproducing kernel Banach spaces

链接: https://arxiv.org/abs/2409.18132
作者: Seungcheol Shin,Myungjoo Kang
关键词-EN:
类目: Functional Analysis (math.FA); Artificial Intelligence (cs.AI)
*备注: 13 pages

点击查看摘要

[AI-115] A shortest-path based clustering algorithm for joint human-machine analysis of complex datasets

链接: https://arxiv.org/abs/1812.11850
作者: Diego Ulisse Pizzagalli,Santiago Fernandez Gonzalez,Rolf Krause
关键词-EN: biomedical research, obtained by empirical, empirical studies, major application, application for biomedical
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Clustering is a technique for the analysis of datasets obtained by empirical studies in several disciplines with a major application for biomedical research. Essentially, clustering algorithms are executed by machines aiming at finding groups of related points in a dataset. However, the result of grouping depends on both metrics for point-to-point similarity and rules for point-to-group association. Indeed, non-appropriate metrics and rules can lead to undesirable clustering artifacts. This is especially relevant for datasets, where groups with heterogeneous structures co-exist. In this work, we propose an algorithm that achieves clustering by exploring the paths between points. This allows both, to evaluate the properties of the path (such as gaps, density variations, etc.), and expressing the preference for certain paths. Moreover, our algorithm supports the integration of existing knowledge about admissible and non-admissible clusters by training a path classifier. We demonstrate the accuracy of the proposed method on challenging datasets including points from synthetic shapes in publicly available benchmarks and microscopy data.

计算机视觉

[CV-0] PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation ECCV2024

链接: https://arxiv.org/abs/2409.18964
作者: Shaowei Liu,Zhongzheng Ren,Saurabh Gupta,Shenlong Wang
关键词-EN: temporally consistent video, input condition, force and torque, method that converts, converts a single
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to ECCV 2024. Project page: this https URL

点击查看摘要

Abstract:We present PhysGen, a novel image-to-video generation method that converts a single image and an input condition (e.g., force and torque applied to an object in the image) to produce a realistic, physically plausible, and temporally consistent video. Our key insight is to integrate model-based physical simulation with a data-driven video generation process, enabling plausible image-space dynamics. At the heart of our system are three core components: (i) an image understanding module that effectively captures the geometry, materials, and physical parameters of the image; (ii) an image-space dynamics simulation model that utilizes rigid-body physics and inferred parameters to simulate realistic behaviors; and (iii) an image-based rendering and refinement module that leverages generative video diffusion to produce realistic video footage featuring the simulated motion. The resulting videos are realistic in both physics and appearance and are even precisely controllable, showcasing superior results over existing data-driven image-to-video generation works through quantitative comparison and comprehensive user study. PhysGen’s resulting videos can be used for various downstream applications, such as turning an image into a realistic animation or allowing users to interact with the image and create various dynamics. Project page: this https URL

[CV-1] Exploring Token Pruning in Vision State Space Models NEURIPS’24

链接: https://arxiv.org/abs/2409.18962
作者: Zheng Zhan,Zhenglun Kong,Yifan Gong,Yushu Wu,Zichong Meng,Hangyu Zheng,Xuan Shen,Stratis Ioannidis,Wei Niu,Pu Zhao,Yanzhi Wang
关键词-EN: State Space Models, powerful vision foundation, keeping linear computational, linear computational complexity, computational complexity compared
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: NeurIPS’24

点击查看摘要

Abstract:State Space Models (SSMs) have the advantage of keeping linear computational complexity compared to attention modules in transformers, and have been applied to vision tasks as a new type of powerful vision foundation model. Inspired by the observations that the final prediction in vision transformers (ViTs) is only based on a subset of most informative tokens, we take the novel step of enhancing the efficiency of SSM-based vision models through token-based pruning. However, direct applications of existing token pruning techniques designed for ViTs fail to deliver good performance, even with extensive fine-tuning. To address this issue, we revisit the unique computational characteristics of SSMs and discover that naive application disrupts the sequential token positions. This insight motivates us to design a novel and general token pruning method specifically for SSM-based vision models. We first introduce a pruning-aware hidden state alignment method to stabilize the neighborhood of remaining tokens for performance enhancement. Besides, based on our detailed analysis, we propose a token importance evaluation method adapted for SSM models, to guide the token pruning. With efficient implementation and practical acceleration methods, our method brings actual speedup. Extensive experiments demonstrate that our approach can achieve significant computation reduction with minimal impact on performance across different tasks. Notably, we achieve 81.7% accuracy on ImageNet with a 41.6% reduction in the FLOPs for pruned PlainMamba-L3. Furthermore, our work provides deeper insights into understanding the behavior of SSM-based vision models for future research.

[CV-2] ProMerge: Prompt and Merge for Unsupervised Instance Segmentation ECCV2024

链接: https://arxiv.org/abs/2409.18961
作者: Dylan Li,Gyungin Shin
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: ECCV2024 camera-ready

点击查看摘要

[CV-3] UniCal: Unified Neural Sensor Calibration ECCV2024

链接: https://arxiv.org/abs/2409.18953
作者: Ze Yang,George Chen,Haowei Zhang,Kevin Ta,Ioan Andrei Bârsan,Daniel Murphy,Sivabalan Manivasagam,Raquel Urtasun
关键词-EN: Self-driving vehicles, accurately for autonomy, Self-driving, require accurate calibration, calibration
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: ECCV 2024. Project page: this https URL

点击查看摘要

Abstract:Self-driving vehicles (SDVs) require accurate calibration of LiDARs and cameras to fuse sensor data accurately for autonomy. Traditional calibration methods typically leverage fiducials captured in a controlled and structured scene and compute correspondences to optimize over. These approaches are costly and require substantial infrastructure and operations, making it challenging to scale for vehicle fleets. In this work, we propose UniCal, a unified framework for effortlessly calibrating SDVs equipped with multiple LiDARs and cameras. Our approach is built upon a differentiable scene representation capable of rendering multi-view geometrically and photometrically consistent sensor observations. We jointly learn the sensor calibration and the underlying scene representation through differentiable volume rendering, utilizing outdoor sensor data without the need for specific calibration fiducials. This “drive-and-calibrate” approach significantly reduces costs and operational overhead compared to existing calibration systems, enabling efficient calibration for large SDV fleets at scale. To ensure geometric consistency across observations from different sensors, we introduce a novel surface alignment loss that combines feature-based registration with neural rendering. Comprehensive evaluations on multiple datasets demonstrate that UniCal outperforms or matches the accuracy of existing calibration approaches while being more efficient, demonstrating the value of UniCal for scalable calibration.

[CV-4] Spectral Wavelet Dropout: Regularization in the Wavelet Domain ICML

链接: https://arxiv.org/abs/2409.18951
作者: Rinor Cakaj,Jens Mehnert,Bin Yang
关键词-EN: convolutional neural networks, techniques help prevent, ability of convolutional, convolutional neural, Spectral Wavelet Dropout
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted by The International Conference on Machine Learning and Applications (ICMLA) 2024

点击查看摘要

Abstract:Regularization techniques help prevent overfitting and therefore improve the ability of convolutional neural networks (CNNs) to generalize. One reason for overfitting is the complex co-adaptations among different parts of the network, which make the CNN dependent on their joint response rather than encouraging each part to learn a useful feature representation independently. Frequency domain manipulation is a powerful strategy for modifying data that has temporal and spatial coherence by utilizing frequency decomposition. This work introduces Spectral Wavelet Dropout (SWD), a novel regularization method that includes two variants: 1D-SWD and 2D-SWD. These variants improve CNN generalization by randomly dropping detailed frequency bands in the discrete wavelet decomposition of feature maps. Our approach distinguishes itself from the pre-existing Spectral “Fourier” Dropout (2D-SFD), which eliminates coefficients in the Fourier domain. Notably, SWD requires only a single hyperparameter, unlike the two required by SFD. We also extend the literature by implementing a one-dimensional version of Spectral “Fourier” Dropout (1D-SFD), setting the stage for a comprehensive comparison. Our evaluation shows that both 1D and 2D SWD variants have competitive performance on CIFAR-10/100 benchmarks relative to both 1D-SFD and 2D-SFD. Specifically, 1D-SWD has a significantly lower computational complexity compared to 1D/2D-SFD. In the Pascal VOC Object Detection benchmark, SWD variants surpass 1D-SFD and 2D-SFD in performance and demonstrate lower computational complexity during training.

[CV-5] From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding

链接: https://arxiv.org/abs/2409.18938
作者: Heqing Zou,Tianze Luo,Guiyang Xie,Victor(Xiao Jie)Zhang,Fengmao Lv,Guangcong Wang,Juanyang Chen,Zhuochen Wang,Hansheng Zhang,Huaijian Zhang
关键词-EN: Large Language Models, Large Language, MultiModal Large Language, Language Models, recently shown promising
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 11 pages

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) with visual encoders has recently shown promising performance in visual understanding tasks, leveraging their inherent capability to comprehend and generate human-like text for visual reasoning. Given the diverse nature of visual data, MultiModal Large Language Models (MM-LLMs) exhibit variations in model designing and training for understanding images, short videos, and long videos. Our paper focuses on the substantial differences and unique challenges posed by long video understanding compared to static image and short video understanding. Unlike static images, short videos encompass sequential frames with both spatial and within-event temporal information, while long videos consist of multiple events with between-event and long-term temporal information. In this survey, we aim to trace and summarize the advancements of MM-LLMs from image understanding to long video understanding. We review the differences among various visual understanding tasks and highlight the challenges in long video understanding, including more fine-grained spatiotemporal details, dynamic events, and long-term dependencies. We then provide a detailed summary of the advancements in MM-LLMs in terms of model design and training methodologies for understanding long videos. Finally, we compare the performance of existing MM-LLMs on video understanding benchmarks of various lengths and discuss potential future directions for MM-LLMs in long video understanding.

[CV-6] ReviveDiff: A Universal Diffusion Model for Restoring Images in Adverse Weather Conditions

链接: https://arxiv.org/abs/2409.18932
作者: Wenfeng Huang,Guoan Xu,Wenjing Jia,Stuart Perry,Guangwei Gao
关键词-EN: challenging environments, captured in challenging, suffer from significant, substantial loss, loss of visual
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Images captured in challenging environments–such as nighttime, foggy, rainy weather, and underwater–often suffer from significant degradation, resulting in a substantial loss of visual quality. Effective restoration of these degraded images is critical for the subsequent vision tasks. While many existing approaches have successfully incorporated specific priors for individual tasks, these tailored solutions limit their applicability to other degradations. In this work, we propose a universal network architecture, dubbed “ReviveDiff”, which can address a wide range of degradations and bring images back to life by enhancing and restoring their quality. Our approach is inspired by the observation that, unlike degradation caused by movement or electronic issues, quality degradation under adverse conditions primarily stems from natural media (such as fog, water, and low luminance), which generally preserves the original structures of objects. To restore the quality of such images, we leveraged the latest advancements in diffusion models and developed ReviveDiff to restore image quality from both macro and micro levels across some key factors determining image quality, such as sharpness, distortion, noise level, dynamic range, and color accuracy. We rigorously evaluated ReviveDiff on seven benchmark datasets covering five types of degrading conditions: Rainy, Underwater, Low-light, Smoke, and Nighttime Hazy. Our experimental results demonstrate that ReviveDiff outperforms the state-of-the-art methods both quantitatively and visually.

[CV-7] SurfaceAI: Automated creation of cohesive road surface quality datasets based on open street-level imagery

链接: https://arxiv.org/abs/2409.18922
作者: Alexandra Kapp,Edith Hoffmann,Esther Weigmann,Helena Mihaljević
关键词-EN: generate comprehensive georeferenced, comprehensive georeferenced datasets, paper introduces SurfaceAI, paper introduces, pipeline designed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 4 pages, 2 figures; accepted at 2nd ACM SIGSPATIAL International Workshop on Advances in Urban-AI

点击查看摘要

Abstract:This paper introduces SurfaceAI, a pipeline designed to generate comprehensive georeferenced datasets on road surface type and quality from openly available street-level imagery. The motivation stems from the significant impact of road unevenness on the safety and comfort of traffic participants, especially vulnerable road users, emphasizing the need for detailed road surface data in infrastructure modeling and analysis. SurfaceAI addresses this gap by leveraging crowdsourced Mapillary data to train models that predict the type and quality of road surfaces visible in street-level images, which are then aggregated to provide cohesive information on entire road segment conditions.

[CV-8] Improving Visual Object Tracking through Visual Prompting

链接: https://arxiv.org/abs/2409.18901
作者: Shih-Fang Chen,Jun-Cheng Chen,I-Hong Jhuo,Yen-Yu Lin
关键词-EN: visual object tracking, generic visual object, visual prompt, visual, object tracking
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Image and Video Processing (eess.IV)
*备注: Accepted and to appear in IEEE Transactions on Multimedia

点击查看摘要

Abstract:Learning a discriminative model to distinguish a target from its surrounding distractors is essential to generic visual object tracking. Dynamic target representation adaptation against distractors is challenging due to the limited discriminative capabilities of prevailing trackers. We present a new visual Prompting mechanism for generic Visual Object Tracking (PiVOT) to address this issue. PiVOT proposes a prompt generation network with the pre-trained foundation model CLIP to automatically generate and refine visual prompts, enabling the transfer of foundation model knowledge for tracking. While CLIP offers broad category-level knowledge, the tracker, trained on instance-specific data, excels at recognizing unique object instances. Thus, PiVOT first compiles a visual prompt highlighting potential target locations. To transfer the knowledge of CLIP to the tracker, PiVOT leverages CLIP to refine the visual prompt based on the similarities between candidate objects and the reference templates across potential targets. Once the visual prompt is refined, it can better highlight potential target locations, thereby reducing irrelevant prompt information. With the proposed prompting mechanism, the tracker can generate improved instance-aware feature maps through the guidance of the visual prompt, thus effectively reducing distractors. The proposed method does not involve CLIP during training, thereby keeping the same training complexity and preserving the generalization capability of the pretrained foundation model. Extensive experiments across multiple benchmarks indicate that PiVOT, using the proposed prompting method can suppress distracting objects and enhance the tracker.

[CV-9] Unsupervised Low-light Image Enhancement with Lookup Tables and Diffusion Priors

链接: https://arxiv.org/abs/2409.18899
作者: Yunlong Lin,Zhenqi Fu,Kairun Wen,Tian Ye,Sixiang Chen,Ge Meng,Yingying Wang,Yue Huang,Xiaotong Tu,Xinghao Ding
关键词-EN: poor illumination environments, Low-light image enhancement, illumination environments, precisely and efficiently, efficiently recovering
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 13 pages, 10 figures

点击查看摘要

Abstract:Low-light image enhancement (LIE) aims at precisely and efficiently recovering an image degraded in poor illumination environments. Recent advanced LIE techniques are using deep neural networks, which require lots of low-normal light image pairs, network parameters, and computational resources. As a result, their practicality is limited. In this work, we devise a novel unsupervised LIE framework based on diffusion priors and lookup tables (DPLUT) to achieve efficient low-light image recovery. The proposed approach comprises two critical components: a light adjustment lookup table (LLUT) and a noise suppression lookup table (NLUT). LLUT is optimized with a set of unsupervised losses. It aims at predicting pixel-wise curve parameters for the dynamic range adjustment of a specific image. NLUT is designed to remove the amplified noise after the light brightens. As diffusion models are sensitive to noise, diffusion priors are introduced to achieve high-performance noise suppression. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods in terms of visual quality and efficiency.

[CV-10] Detecting Dataset Abuse in Fine-Tuning Stable Diffusion Models for Text-to-Image Synthesis

链接: https://arxiv.org/abs/2409.18897
作者: Songrui Wang,Yubo Zhu,Wei Tong,Sheng Zhong
关键词-EN: requiring fine-tuning generative, fine-tuning generative models, stylized images, specialized tasks, Stable Diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Text-to-image synthesis has become highly popular for generating realistic and stylized images, often requiring fine-tuning generative models with domain-specific datasets for specialized tasks. However, these valuable datasets face risks of unauthorized usage and unapproved sharing, compromising the rights of the owners. In this paper, we address the issue of dataset abuse during the fine-tuning of Stable Diffusion models for text-to-image synthesis. We present a dataset watermarking framework designed to detect unauthorized usage and trace data leaks. The framework employs two key strategies across multiple watermarking schemes and is effective for large-scale dataset authorization. Extensive experiments demonstrate the framework’s effectiveness, minimal impact on the dataset (only 2% of the data required to be modified for high detection accuracy), and ability to trace data leaks. Our results also highlight the robustness and transferability of the framework, proving its practical applicability in detecting dataset abuse.

[CV-11] S2O: Static to Openable Enhancement for Articulated 3D Objects

链接: https://arxiv.org/abs/2409.18896
作者: Denys Iliash,Hanxiao Jiang,Yiming Zhang,Manolis Savva,Angel X. Chang
关键词-EN: manual effort required, progress in large, scale is limited, limited due, manual effort
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Despite much progress in large 3D datasets there are currently few interactive 3D object datasets, and their scale is limited due to the manual effort required in their construction. We introduce the static to openable (S2O) task which creates interactive articulated 3D objects from static counterparts through openable part detection, motion prediction, and interior geometry completion. We formulate a unified framework to tackle this task, and curate a challenging dataset of openable 3D objects that serves as a test bed for systematic evaluation. Our experiments benchmark methods from prior work and simple yet effective heuristics for the S2O task. We find that turning static 3D objects into interactively openable counterparts is possible but that all methods struggle to generalize to realistic settings of the task, and we highlight promising future work directions.

[CV-12] Explainable Artifacts for Synthetic Western Blot Source Attribution

链接: https://arxiv.org/abs/2409.18881
作者: João Phillipe Cardenuto,Sara Mandelli,Daniel Moreira,Paolo Bestagini,Edward Delp,Anderson Rocha
关键词-EN: expert scientists habituated, Recent advancements, produce synthetic scientific, enabled generative models, Convolutional Neural Networks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in IEEE International Workshop on Information Forensics and Security - WIFS 2024, Rome, Italy

点击查看摘要

Abstract:Recent advancements in artificial intelligence have enabled generative models to produce synthetic scientific images that are indistinguishable from pristine ones, posing a challenge even for expert scientists habituated to working with such content. When exploited by organizations known as paper mills, which systematically generate fraudulent articles, these technologies can significantly contribute to the spread of misinformation about ungrounded science, potentially undermining trust in scientific research. While previous studies have explored black-box solutions, such as Convolutional Neural Networks, for identifying synthetic content, only some have addressed the challenge of generalizing across different models and providing insight into the artifacts in synthetic images that inform the detection process. This study aims to identify explainable artifacts generated by state-of-the-art generative models (e.g., Generative Adversarial Networks and Diffusion Models) and leverage them for open-set identification and source attribution (i.e., pointing to the model that created the image).

[CV-13] UniEmoX: Cross-modal Semantic-Guided Large-Scale Pretraining for Universal Scene Emotion Perception

链接: https://arxiv.org/abs/2409.18877
作者: Chuang Chen,Xiao Sun,Zhi Liu
关键词-EN: Visual emotion analysis, analysis holds significant, emotion analysis holds, holds significant research, emotion analysis
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to TIP

点击查看摘要

Abstract:Visual emotion analysis holds significant research value in both computer vision and psychology. However, existing methods for visual emotion analysis suffer from limited generalizability due to the ambiguity of emotion perception and the diversity of data scenarios. To tackle this issue, we introduce UniEmoX, a cross-modal semantic-guided large-scale pretraining framework. Inspired by psychological research emphasizing the inseparability of the emotional exploration process from the interaction between individuals and their environment, UniEmoX integrates scene-centric and person-centric low-level image spatial structural information, aiming to derive more nuanced and discriminative emotional representations. By exploiting the similarity between paired and unpaired image-text samples, UniEmoX distills rich semantic knowledge from the CLIP model to enhance emotional embedding representations more effectively. To the best of our knowledge, this is the first large-scale pretraining framework that integrates psychological theories with contemporary contrastive learning and masked image modeling techniques for emotion analysis across diverse scenarios. Additionally, we develop a visual emotional dataset titled Emo8. Emo8 samples cover a range of domains, including cartoon, natural, realistic, science fiction and advertising cover styles, covering nearly all common emotional scenes. Comprehensive experiments conducted on six benchmark datasets across two downstream tasks validate the effectiveness of UniEmoX. The source code is available at this https URL.

[CV-14] CemiFace: Center-based Semi-hard Synthetic Face Generation for Face Recognition NEURIPS2024

链接: https://arxiv.org/abs/2409.18876
作者: Zhonglin Sun,Siyang Song,Ioannis Patras,Georgios Tzimiropoulos
关键词-EN: Privacy issue, face recognition techniques, developing face recognition, face recognition, face
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted to NeurIPS 2024. We are preparing the camera-ready version according to the reviews

点击查看摘要

Abstract:Privacy issue is a main concern in developing face recognition techniques. Although synthetic face images can partially mitigate potential legal risks while maintaining effective face recognition (FR) performance, FR models trained by face images synthesized by existing generative approaches frequently suffer from performance degradation problems due to the insufficient discriminative quality of these synthesized samples. In this paper, we systematically investigate what contributes to solid face recognition model training, and reveal that face images with certain degree of similarities to their identity centers show great effectiveness in the performance of trained FR models. Inspired by this, we propose a novel diffusion-based approach (namely Center-based Semi-hard Synthetic Face Generation (CemiFace)) which produces facial samples with various levels of similarity to the subject center, thus allowing to generate face datasets containing effective discriminative samples for training face recognition. Experimental results show that with a modest degree of similarity, training on the generated dataset can produce competitive performance compared to previous generation methods.

[CV-15] Emu3: Next-Token Prediction is All You Need

链接: https://arxiv.org/abs/2409.18869
作者: Xinlong Wang,Xiaosong Zhang,Zhengxiong Luo,Quan Sun,Yufeng Cui,Jinsheng Wang,Fan Zhang,Yueze Wang,Zhen Li,Qiying Yu,Yingli Zhao,Yulong Ao,Xuebin Min,Tao Li,Boya Wu,Bo Zhao,Bowen Zhang,Liangdong Wang,Guang Liu,Zheqi He,Xi Yang,Jingjing Liu,Yonghua Lin,Tiejun Huang,Zhongyuan Wang
关键词-EN: CLIP combined, Stable Diffusion, next-token prediction, combined with LLMs, struggled to excel
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:While next-token prediction is considered a promising path towards artificial general intelligence, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e.g., Stable Diffusion) and compositional approaches (e.g., CLIP combined with LLMs). In this paper, we introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences. Emu3 outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship models such as SDXL and LLaVA-1.6, while eliminating the need for diffusion or compositional architectures. Emu3 is also capable of generating high-fidelity video via predicting the next token in a video sequence. We simplify complex multimodal model designs by converging on a singular focus: tokens, unlocking great potential for scaling both during training and inference. Our results demonstrate that next-token prediction is a promising path towards building general multimodal intelligence beyond language. We open-source key techniques and models to support further research in this direction.

[CV-16] MCUBench: A Benchmark of Tiny Object Detectors on MCUs

链接: https://arxiv.org/abs/2409.18866
作者: Sudhakar Sah,Darshan C. Ganji,Matteo Grimaldi,Ravish Kumar,Alexander Hoffman,Honnesh Rohmetra,Ehsan Saboori
关键词-EN: VOC dataset, average precision, benchmark featuring, detection models evaluated, VOC
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code and data are available at this https URL

点击查看摘要

Abstract:We introduce MCUBench, a benchmark featuring over 100 YOLO-based object detection models evaluated on the VOC dataset across seven different MCUs. This benchmark provides detailed data on average precision, latency, RAM, and Flash usage for various input resolutions and YOLO-based one-stage detectors. By conducting a controlled comparison with a fixed training pipeline, we collect comprehensive performance metrics. Our Pareto-optimal analysis shows that integrating modern detection heads and training techniques allows various YOLO architectures, including legacy models like YOLOv3, to achieve a highly efficient tradeoff between mean Average Precision (mAP) and latency. MCUBench serves as a valuable tool for benchmarking the MCU performance of contemporary object detectors and aids in model selection based on specific constraints.

[CV-17] LW2G: Learning Whether to Grow for Prompt-based Continual Learning NEURIPS2024

链接: https://arxiv.org/abs/2409.18860
作者: Qian Feng,Dawei Zhou,Hanbin Zhao,Chao Zhang,Hui Qian
关键词-EN: Prompt-based Continual Learning, Continual Learning, Projection Continual Learning, Recent Prompt-based Continual, Prompt-based Continual
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: submit to neurips2024

点击查看摘要

Abstract:Continual Learning (CL) aims to learn in non-stationary scenarios, progressively acquiring and maintaining knowledge from sequential tasks. Recent Prompt-based Continual Learning (PCL) has achieved remarkable performance with Pre-Trained Models (PTMs). These approaches grow a prompt sets pool by adding a new set of prompts when learning each new task (\emphprompt learning) and adopt a matching mechanism to select the correct set for each testing sample (\emphprompt retrieval). Previous studies focus on the latter stage by improving the matching mechanism to enhance Prompt Retrieval Accuracy (PRA). To promote cross-task knowledge facilitation and form an effective and efficient prompt sets pool, we propose a plug-in module in the former stage to \textbfLearn Whether to Grow (LW2G) based on the disparities between tasks. Specifically, a shared set of prompts is utilized when several tasks share certain commonalities, and a new set is added when there are significant differences between the new task and previous tasks. Inspired by Gradient Projection Continual Learning, our LW2G develops a metric called Hinder Forward Capability (HFC) to measure the hindrance imposed on learning new tasks by surgically modifying the original gradient onto the orthogonal complement of the old feature space. With HFC, an automated scheme Dynamic Growing Approach adaptively learns whether to grow with a dynamic threshold. Furthermore, we design a gradient-based constraint to ensure the consistency between the updating prompts and pre-trained knowledge, and a prompts weights reusing strategy to enhance forward transfer. Extensive experiments show the effectiveness of our method. The source codes are available at \urlthis https URL.

[CV-18] Space-time 2D Gaussian Splatting for Accurate Surface Reconstruction under Complex Dynamic Scenes

链接: https://arxiv.org/abs/2409.18852
作者: Shuo Wang,Binbin Huang,Ruoyu Wang,Shenghua Gao
关键词-EN: involving multi-person activities, lengthy training times, scenes involving multi-person, Previous surface reconstruction, low geometric accuracy
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Previous surface reconstruction methods either suffer from low geometric accuracy or lengthy training times when dealing with real-world complex dynamic scenes involving multi-person activities, and human-object interactions. To tackle the dynamic contents and the occlusions in complex scenes, we present a space-time 2D Gaussian Splatting approach. Specifically, to improve geometric quality in dynamic scenes, we learn canonical 2D Gaussian splats and deform these 2D Gaussian splats while enforcing the disks of the Gaussian located on the surface of the objects by introducing depth and normal regularizers. Further, to tackle the occlusion issues in complex scenes, we introduce a compositional opacity deformation strategy, which further reduces the surface recovery of those occluded areas. Experiments on real-world sparse-view video datasets and monocular dynamic datasets demonstrate that our reconstructions outperform state-of-the-art methods, especially for the surface of the details. The project page and more visualizations can be found at: this https URL.

[CV-19] MinerU: An Open-Source Solution for Precise Document Content Extraction

链接: https://arxiv.org/abs/2409.18839
作者: Bin Wang,Chao Xu,Xiaomeng Zhao,Linke Ouyang,Fan Wu,Zhiyuan Zhao,Rui Xu,Kaiwen Liu,Yuan Qu,Fukai Shang,Bo Zhang,Liqun Wei,Zhihao Sui,Wei Li,Botian Shi,Yu Qiao,Dahua Lin,Conghui He
关键词-EN: crucial research area, computer vision, crucial research, research area, area in computer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: MinerU Technical Report

点击查看摘要

Abstract:Document content analysis has been a crucial research area in computer vision. Despite significant advancements in methods such as OCR, layout detection, and formula recognition, existing open-source solutions struggle to consistently deliver high-quality content extraction due to the diversity in document types and content. To address these challenges, we present MinerU, an open-source solution for high-precision document content extraction. MinerU leverages the sophisticated PDF-Extract-Kit models to extract content from diverse documents effectively and employs finely-tuned preprocessing and postprocessing rules to ensure the accuracy of the final results. Experimental results demonstrate that MinerU consistently achieves high performance across various document types, significantly enhancing the quality and consistency of content extraction. The MinerU open-source project is available at this https URL.

[CV-20] Classification and regression of trajectories rendered as images via 2D Convolutional Neural Networks

链接: https://arxiv.org/abs/2409.18832
作者: Mariaclaudia Nicolai,Raffaella Fiamma Cabini,Diego Ulisse Pizzagalli
关键词-EN: typically arising, motile objects, regarded as time-series, arising from motile, Trajectories
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 13 pages, 5 figures

点击查看摘要

Abstract:Trajectories can be regarded as time-series of coordinates, typically arising from motile objects. Methods for trajectory classification are particularly important to detect different movement patterns, while methods for regression to compute motility metrics and forecasting. Recent advances in computer vision have facilitated the processing of trajectories rendered as images via artificial neural networks with 2d convolutional layers (CNNs). This approach leverages the capability of CNNs to learn spatial hierarchies of features from images, necessary to recognize complex shapes. Moreover, it overcomes the limitation of other machine learning methods that require input trajectories with a fixed number of points. However, rendering trajectories as images can introduce poorly investigated artifacts such as information loss due to the plotting of coordinates on a discrete grid, and spectral changes due to line thickness and aliasing. In this study, we investigate the effectiveness of CNNs for solving classification and regression problems from synthetic trajectories that have been rendered as images using different modalities. The parameters considered in this study include line thickness, image resolution, usage of motion history (color-coding of the temporal component) and anti-aliasing. Results highlight the importance of choosing an appropriate image resolution according to model depth and motion history in applications where movement direction is critical.

[CV-21] YOLOv8-ResCBAM: YOLOv8 Based on An Effective Attention Module for Pediatric Wrist Fracture Detection ICONIP2024

链接: https://arxiv.org/abs/2409.18826
作者: Rui-Yang Ju,Chun-Tse Chien,Jen-Shiun Chiang
关键词-EN: fractures occur frequently, Wrist trauma, daily life, occur frequently, frequently in daily
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ICONIP 2024. arXiv admin note: substantial text overlap with arXiv:2402.09329

点击查看摘要

Abstract:Wrist trauma and even fractures occur frequently in daily life, particularly among children who account for a significant proportion of fracture cases. Before performing surgery, surgeons often request patients to undergo X-ray imaging first, and prepare for the surgery based on the analysis of the X-ray images. With the development of neural networks, You Only Look Once (YOLO) series models have been widely used in fracture detection for Computer-Assisted Diagnosis, where the YOLOv8 model has obtained the satisfactory results. Applying the attention modules to neural networks is one of the effective methods to improve the model performance. This paper proposes YOLOv8-ResCBAM, which incorporates Convolutional Block Attention Module integrated with resblock (ResCBAM) into the original YOLOv8 network architecture. The experimental results on the GRAZPEDWRI-DX dataset demonstrate that the mean Average Precision calculated at Intersection over Union threshold of 0.5 (mAP 50) of the proposed model increased from 63.6% of the original YOLOv8 model to 65.8%, which achieves the state-of-the-art performance. The implementation code is available at this https URL.

[CV-22] EyeTrAES: Fine-grained Low-Latency Eye Tracking via Adaptive Event Slicing

链接: https://arxiv.org/abs/2409.18813
作者: Argha Sen,Nuwan Bandara,Ila Gokarn,Thivya Kandappu,Archan Misra
关键词-EN: recent years due, gained significant attention, RGB camera-based eye-tracking, human-computer interaction, virtual and augmented
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注: 32 pages,15 figures,

点击查看摘要

Abstract:Eye-tracking technology has gained significant attention in recent years due to its wide range of applications in human-computer interaction, virtual and augmented reality, and wearable health. Traditional RGB camera-based eye-tracking systems often struggle with poor temporal resolution and computational constraints, limiting their effectiveness in capturing rapid eye movements. To address these limitations, we propose EyeTrAES, a novel approach using neuromorphic event cameras for high-fidelity tracking of natural pupillary movement that shows significant kinematic variance. One of EyeTrAES’s highlights is the use of a novel adaptive windowing/slicing algorithm that ensures just the right amount of descriptive asynchronous event data accumulation within an event frame, across a wide range of eye movement patterns. EyeTrAES then applies lightweight image processing functions over accumulated event frames from just a single eye to perform pupil segmentation and tracking. We show that these methods boost pupil tracking fidelity by 6+%, achieving IoU~=92%, while incurring at least 3x lower latency than competing pure event-based eye tracking alternatives [38]. We additionally demonstrate that the microscopic pupillary motion captured by EyeTrAES exhibits distinctive variations across individuals and can thus serve as a biometric fingerprint. For robust user authentication, we train a lightweight per-user Random Forest classifier using a novel feature vector of short-term pupillary kinematics, comprising a sliding window of pupil (location, velocity, acceleration) triples. Experimental studies with two different datasets demonstrate that the EyeTrAES-based authentication technique can simultaneously achieve high authentication accuracy (~=0.82) and low processing latency (~=12ms), and significantly outperform multiple state-of-the-art competitive baselines.

[CV-23] MiniVLN: Efficient Vision-and-Language Navigation by Progressive Knowledge Distillation

链接: https://arxiv.org/abs/2409.18800
作者: Junyou Zhu,Yanyuan Qiao,Siqi Zhang,Xingjian He,Qi Wu,Jing Liu
关键词-EN: Embodied Artificial Intelligence, Artificial Intelligence, limited computational capabilities, Embodied Artificial, recent years
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent years, Embodied Artificial Intelligence (Embodied AI) has advanced rapidly, yet the increasing size of models conflicts with the limited computational capabilities of Embodied AI platforms. To address this challenge, we aim to achieve both high model performance and practical deployability. Specifically, we focus on Vision-and-Language Navigation (VLN), a core task in Embodied AI. This paper introduces a two-stage knowledge distillation framework, producing a student model, MiniVLN, and showcasing the significant potential of distillation techniques in developing lightweight models. The proposed method aims to capture fine-grained knowledge during the pretraining phase and navigation-specific knowledge during the fine-tuning phase. Our findings indicate that the two-stage distillation approach is more effective in narrowing the performance gap between the teacher model and the student model compared to single-stage distillation. On the public R2R and REVERIE benchmarks, MiniVLN achieves performance on par with the teacher model while having only about 12% of the teacher model’s parameter count.

[CV-24] Supervised Learning Model for Key Frame Identification from Cow Teat Videos

链接: https://arxiv.org/abs/2409.18797
作者: Minghao Wang,Pinxue Lin
关键词-EN: cow teat, mastitis risk assessment, proposes a method, method for improving, cow
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:This paper proposes a method for improving the accuracy of mastitis risk assessment in cows using neural networks and video analysis. Mastitis, an infection of the udder tissue, is a critical health problem for cows and can be detected by examining the cow’s teat. Traditionally, veterinarians assess the health of a cow’s teat during the milking process, but this process is limited in time and can weaken the accuracy of the assessment. In commercial farms, cows are recorded by cameras when they are milked in the milking parlor. This paper uses a neural network to identify key frames in the recorded video where the cow’s udder appears intact. These key frames allow veterinarians to have more flexible time to perform health assessments on the teat, increasing their efficiency and accuracy. However, there are challenges in using cow teat video for mastitis risk assessment, such as complex environments, changing cow positions and postures, and difficulty in identifying the udder from the video. To address these challenges, a fusion distance and an ensemble model are proposed to improve the performance (F-score) of identifying key frames from cow teat videos. The results show that these two approaches improve performance compared to using a single distance measure or model.

[CV-25] Open-Nav: Exploring Zero-Shot Vision-and-Language Navigation in Continuous Environment with Open-Source LLMs

链接: https://arxiv.org/abs/2409.18794
作者: Yanyuan Qiao,Wenqi Lyu,Hui Wang,Zixu Wang,Zerui Li,Yuan Zhang,Mingkui Tan,Qi Wu
关键词-EN: follow textual instructions, require an agent, agent to follow, follow textual, train VLN models
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision-and-Language Navigation (VLN) tasks require an agent to follow textual instructions to navigate through 3D environments. Traditional approaches use supervised learning methods, relying heavily on domain-specific datasets to train VLN models. Recent methods try to utilize closed-source large language models (LLMs) like GPT-4 to solve VLN tasks in zero-shot manners, but face challenges related to expensive token costs and potential data breaches in real-world applications. In this work, we introduce Open-Nav, a novel study that explores open-source LLMs for zero-shot VLN in the continuous environment. Open-Nav employs a spatial-temporal chain-of-thought (CoT) reasoning approach to break down tasks into instruction comprehension, progress estimation, and decision-making. It enhances scene perceptions with fine-grained object and spatial knowledge to improve LLM’s reasoning in navigation. Our extensive experiments in both simulated and real-world environments demonstrate that Open-Nav achieves competitive performance compared to using closed-source LLMs.

[CV-26] Excavating in the Wild: The GOOSE-Ex Dataset for Semantic Segmentation

链接: https://arxiv.org/abs/2409.18788
作者: Raphael Hagmanns,Peter Mortimer,Miguel Granero,Thorsten Luettel,Janko Petereit
关键词-EN: deep learning-based techniques, successful deployment, autonomous systems, respective system, deep learning-based
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to IEEE for review

点击查看摘要

Abstract:The successful deployment of deep learning-based techniques for autonomous systems is highly dependent on the data availability for the respective system in its deployment environment. Especially for unstructured outdoor environments, very few datasets exist for even fewer robotic platforms and scenarios. In an earlier work, we presented the German Outdoor and Offroad Dataset (GOOSE) framework along with 10000 multimodal frames from an offroad vehicle to enhance the perception capabilities in unstructured environments. In this work, we address the generalizability of the GOOSE framework. To accomplish this, we open-source the GOOSE-Ex dataset, which contains additional 5000 labeled multimodal frames from various completely different environments, recorded on a robotic excavator and a quadruped platform. We perform a comprehensive analysis of the semantic segmentation performance on different platforms and sensor modalities in unseen environments. In addition, we demonstrate how the combined datasets can be utilized for different downstream applications or competitions such as offroad navigation, object manipulation or scene completion. The dataset, its platform documentation and pre-trained state-of-the-art models for offroad perception will be made available on this https URL. \ Comments: Submitted to IEEE for review Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.18788 [cs.RO] (or arXiv:2409.18788v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2409.18788 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-27] Student-Oriented Teacher Knowledge Refinement for Knowledge Distillation

链接: https://arxiv.org/abs/2409.18785
作者: Chaomin Shen,Yaomin Huang,Haokun Zhu,Jinsong Fan,Guixu Zhang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-28] Relighting from a Single Image: Datasets and Deep Intrinsic-based Architecture

链接: https://arxiv.org/abs/2409.18770
作者: Yixiong Yang,Hassan Ahmed Sial,Ramon Baldrich,Maria Vanrell
关键词-EN: Single image scene, Single image, target light condition, scene relighting aims, image scene relighting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for publication as a Regular paper in the IEEE Transactions on Multimedia

点击查看摘要

Abstract:Single image scene relighting aims to generate a realistic new version of an input image so that it appears to be illuminated by a new target light condition. Although existing works have explored this problem from various perspectives, generating relit images under arbitrary light conditions remains highly challenging, and related datasets are scarce. Our work addresses this problem from both the dataset and methodological perspectives. We propose two new datasets: a synthetic dataset with the ground truth of intrinsic components and a real dataset collected under laboratory conditions. These datasets alleviate the scarcity of existing datasets. To incorporate physical consistency in the relighting pipeline, we establish a two-stage network based on intrinsic decomposition, giving outputs at intermediate steps, thereby introducing physical constraints. When the training set lacks ground truth for intrinsic decomposition, we introduce an unsupervised module to ensure that the intrinsic outputs are satisfactory. Our method outperforms the state-of-the-art methods in performance, as tested on both existing datasets and our newly developed datasets. Furthermore, pretraining our method or other prior methods using our synthetic dataset can enhance their performance on other datasets. Since our method can accommodate any light conditions, it is capable of producing animated results. The dataset, method, and videos are publicly available.

[CV-29] State-of-the-Art Periorbital Distance Prediction and Disease Classification Using Periorbital Features

链接: https://arxiv.org/abs/2409.18769
作者: George R. Nahass,Ghasem Yazdanpanah,Madison Cheung,Alex Palacios,Jeffery Peterson,Kevin Heinze,Sasha Hubschman,Chad A. Purnell,Pete Setabutr,Ann Q. Tran,Darvin Yi
关键词-EN: lids hold valuable, hold valuable information, medical intervention, lids hold, hold valuable
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 16 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Periorbital distances and features around the eyes and lids hold valuable information for disease quantification and monitoring of surgical and medical intervention. These distances are commonly measured manually, a process that is both subjective and highly time-consuming. Here, we set out to developed three deep-learning methods for segmentation and periorbital distance prediction, and also evaluate the utility of periorbital distances for disease classification. The MAE of our deep learning predicted distances was less than or very close to the error observed between trained human annotators. We compared our models to the current state-of-the-art (SOTA) method for periorbital distance prediction and found that our methods outperformed SOTA on all of our datasets on all but one periorbital measurement. We also show that robust segmentation can be achieved on diseased eyes using models trained on open-source, healthy eyes, and that periorbital distances have can be used as high-quality features in downstream classification models. Leveraging segmentation networks as intermediary steps in classification has broad implications for increasing the generalizability of classification models in ophthalmic plastic and craniofacial surgery by avoiding the out-of-distribution problem observed in traditional convolutional neural networks.

[CV-30] Charting the Future: Using Chart Question-Answering for Scalable Evaluation of LLM-Driven Data Visualizations

链接: https://arxiv.org/abs/2409.18764
作者: James Ford,Xingmeng Zhao,Dan Schumacher,Anthony Rios
关键词-EN: Visual Question Answering, leverages Visual Question, Question Answering, Visual Question, framework that leverages
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:We propose a novel framework that leverages Visual Question Answering (VQA) models to automate the evaluation of LLM-generated data visualizations. Traditional evaluation methods often rely on human judgment, which is costly and unscalable, or focus solely on data accuracy, neglecting the effectiveness of visual communication. By employing VQA models, we assess data representation quality and the general communicative clarity of charts. Experiments were conducted using two leading VQA benchmark datasets, ChartQA and PlotQA, with visualizations generated by OpenAI’s GPT-3.5 Turbo and Meta’s Llama 3.1 70B-Instruct models. Our results indicate that LLM-generated charts do not match the accuracy of the original non-LLM-generated charts based on VQA performance measures. Moreover, while our results demonstrate that few-shot prompting significantly boosts the accuracy of chart generation, considerable progress remains to be made before LLMs can fully match the precision of human-generated graphs. This underscores the importance of our work, which expedites the research process by enabling rapid iteration without the need for human annotation, thus accelerating advancements in this field.

[CV-31] Enhancing Explainability in Multimodal Large Language Models Using Ontological Context

链接: https://arxiv.org/abs/2409.18753
作者: Jihen Amara,Birgitta König-Ries,Sheeba Samuel
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, visual question answering
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, there has been a growing interest in Multimodal Large Language Models (MLLMs) due to their remarkable potential in various tasks integrating different modalities, such as image and text, as well as applications such as image captioning and visual question answering. However, such models still face challenges in accurately captioning and interpreting specific visual concepts and classes, particularly in domain-specific applications. We argue that integrating domain knowledge in the form of an ontology can significantly address these issues. In this work, as a proof of concept, we propose a new framework that combines ontology with MLLMs to classify images of plant diseases. Our method uses concepts about plant diseases from an existing disease ontology to query MLLMs and extract relevant visual concepts from images. Then, we use the reasoning capabilities of the ontology to classify the disease according to the identified concepts. Ensuring that the model accurately uses the concepts describing the disease is crucial in domain-specific applications. By employing an ontology, we can assist in verifying this alignment. Additionally, using the ontology’s inference capabilities increases transparency, explainability, and trust in the decision-making process while serving as a judge by checking if the annotations of the concepts by MLLMs are aligned with those in the ontology and displaying the rationales behind their errors. Our framework offers a new direction for synergizing ontologies and MLLMs, supported by an empirical study using different well-known MLLMs.

[CV-32] MemFusionMap: Working Memory Fusion for Online Vectorized HD Map Construction

链接: https://arxiv.org/abs/2409.18737
作者: Jingyu Song,Xudong Chen,Liupei Lu,Jie Li,Katherine A. Skinner
关键词-EN: autonomous driving systems, maps provide environmental, provide environmental information, safe planning, provide environmental
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:High-definition (HD) maps provide environmental information for autonomous driving systems and are essential for safe planning. While existing methods with single-frame input achieve impressive performance for online vectorized HD map construction, they still struggle with complex scenarios and occlusions. We propose MemFusionMap, a novel temporal fusion model with enhanced temporal reasoning capabilities for online HD map construction. Specifically, we contribute a working memory fusion module that improves the model’s memory capacity to reason across history frames. We also design a novel temporal overlap heatmap to explicitly inform the model about the temporal overlap information and vehicle trajectory in the Bird’s Eye View space. By integrating these two designs, MemFusionMap significantly outperforms existing methods while also maintaining a versatile design for scalability. We conduct extensive evaluation on open-source benchmarks and demonstrate a maximum improvement of 5.4% in mAP over state-of-the-art methods. The code for MemFusionMap will be made open-source upon publication of this paper.

[CV-33] Search and Detect: Training-Free Long Tail Object Detection via Web-Image Retrieval

链接: https://arxiv.org/abs/2409.18733
作者: Mankeerat Sidhu,Hetarth Chopra,Ansel Blume,Jeonghwan Kim,Revanth Gangi Reddy,Heng Ji
关键词-EN: significantly enhances open-vocabulary, enhances open-vocabulary object, object detection performance, long-tail object detection, open-vocabulary object detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we introduce SearchDet, a training-free long-tail object detection framework that significantly enhances open-vocabulary object detection performance. SearchDet retrieves a set of positive and negative images of an object to ground, embeds these images, and computes an input image-weighted query which is used to detect the desired concept in the image. Our proposed method is simple and training-free, yet achieves over 48.7% mAP improvement on ODinW and 59.1% mAP improvement on LVIS compared to state-of-the-art models such as GroundingDINO. We further show that our approach of basing object detection on a set of Web-retrieved exemplars is stable with respect to variations in the exemplars, suggesting a path towards eliminating costly data annotation and training procedures.

[CV-34] Learning from Pattern Completion: Self-supervised Controllable Generation

链接: https://arxiv.org/abs/2409.18694
作者: Zhiqiang Chen,Guofan Fan,Jinying Gao,Lei Ma,Bo Lei,Tiejun Huang,Shan Yu
关键词-EN: similar visual scene, real-world visual objects, visual scene, visual objects, visual attributes
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The human brain exhibits a strong ability to spontaneously associate different visual attributes of the same or similar visual scene, such as associating sketches and graffiti with real-world visual objects, usually without supervising information. In contrast, in the field of artificial intelligence, controllable generation methods like ControlNet heavily rely on annotated training datasets such as depth maps, semantic segmentation maps, and poses, which limits the method’s scalability. Inspired by the neural mechanisms that may contribute to the brain’s associative power, specifically the cortical modularization and hippocampal pattern completion, here we propose a self-supervised controllable generation (SCG) framework. Firstly, we introduce an equivariant constraint to promote inter-module independence and intra-module correlation in a modular autoencoder network, thereby achieving functional specialization. Subsequently, based on these specialized modules, we employ a self-supervised pattern completion approach for controllable generation training. Experimental results demonstrate that the proposed modular autoencoder effectively achieves functional specialization, including the modular processing of color, brightness, and edge detection, and exhibits brain-like features including orientation selectivity, color antagonism, and center-surround receptive fields. Through self-supervised training, associative generation capabilities spontaneously emerge in SCG, demonstrating excellent generalization ability to various tasks such as associative generation on painting, sketches, and ancient graffiti. Compared to the previous representative method ControlNet, our proposed approach not only demonstrates superior robustness in more challenging high-noise scenarios but also possesses more promising scalability potential due to its self-supervised manner.

[CV-35] A Novel Unified Architecture for Low-Shot Counting by Detection and Segmentation NEURIPS2024

链接: https://arxiv.org/abs/2409.18686
作者: Jer Pelhan,Alan Lukežič,Vitjan Zavrtanik,Matej Kristan
关键词-EN: annotated exemplars, estimate the number, object, object counters estimate, loss
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS2024

点击查看摘要

Abstract:Low-shot object counters estimate the number of objects in an image using few or no annotated exemplars. Objects are localized by matching them to prototypes, which are constructed by unsupervised image-wide object appearance aggregation. Due to potentially diverse object appearances, the existing approaches often lead to overgeneralization and false positive detections. Furthermore, the best-performing methods train object localization by a surrogate loss, that predicts a unit Gaussian at each object center. This loss is sensitive to annotation error, hyperparameters and does not directly optimize the detection task, leading to suboptimal counts. We introduce GeCo, a novel low-shot counter that achieves accurate object detection, segmentation, and count estimation in a unified architecture. GeCo robustly generalizes the prototypes across objects appearances through a novel dense object query formulation. In addition, a novel counting loss is proposed, that directly optimizes the detection task and avoids the issues of the standard surrogate loss. GeCo surpasses the leading few-shot detection-based counters by \sim 25% in the total count MAE, achieves superior detection accuracy and sets a new solid state-of-the-art result across all low-shot counting setups.

[CV-36] Image-guided topic modeling for interpretable privacy classification ECCV2024

链接: https://arxiv.org/abs/2409.18674
作者: Alina Elena Baia,Andrea Cavallaro
关键词-EN: Predicting and explaining, private information contained, explaining the private, human-understandable terms, complex and contextual
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Paper accepted at the eXCV Workshop at ECCV 2024. Supplementary material included. Code available at this https URL

点击查看摘要

Abstract:Predicting and explaining the private information contained in an image in human-understandable terms is a complex and contextual task. This task is challenging even for large language models. To facilitate the understanding of privacy decisions, we propose to predict image privacy based on a set of natural language content descriptors. These content descriptors are associated with privacy scores that reflect how people perceive image content. We generate descriptors with our novel Image-guided Topic Modeling (ITM) approach. ITM leverages, via multimodality alignment, both vision information and image textual descriptions from a vision language model. We use the ITM-generated descriptors to learn a privacy predictor, Priv \times ITM, whose decisions are interpretable by design. Our Priv \times ITM classifier outperforms the reference interpretable method by 5 percentage points in accuracy and performs comparably to the current non-interpretable state-of-the-art model.

[CV-37] Exploiting Motion Prior for Accurate Pose Estimation of Dashboard Cameras

链接: https://arxiv.org/abs/2409.18673
作者: Yipeng Lu,Yifan Zhao,Haiping Wang,Zhiwei Ruan,Yuan Liu,Zhen Dong,Bisheng Yang
关键词-EN: driving videos daily, including driving map, driving map production, valuable potential data, potential data source
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Dashboard cameras (dashcams) record millions of driving videos daily, offering a valuable potential data source for various applications, including driving map production and updates. A necessary step for utilizing these dashcam data involves the estimation of camera poses. However, the low-quality images captured by dashcams, characterized by motion blurs and dynamic objects, pose challenges for existing image-matching methods in accurately estimating camera poses. In this study, we propose a precise pose estimation method for dashcam images, leveraging the inherent camera motion prior. Typically, image sequences captured by dash cameras exhibit pronounced motion prior, such as forward movement or lateral turns, which serve as essential cues for correspondence estimation. Building upon this observation, we devise a pose regression module aimed at learning camera motion prior, subsequently integrating these prior into both correspondences and pose estimation processes. The experiment shows that, in real dashcams dataset, our method is 22% better than the baseline for pose estimation in AUC5\textdegree, and it can estimate poses for 19% more images with less reprojection error in Structure from Motion (SfM).

[CV-38] When SAM2 Meets Video Camouflaged Object Segmentation: A Comprehensive Evaluation and Adaptation

链接: https://arxiv.org/abs/2409.18653
作者: Yuli Zhou,Guolei Sun,Yawei Li,Luca Benini,Ender Konukoglu
关键词-EN: investigates the application, VCOS, camouflaged object segmentation, Segment, camouflaged
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Technical report

点击查看摘要

Abstract:This study investigates the application and performance of the Segment Anything Model 2 (SAM2) in the challenging task of video camouflaged object segmentation (VCOS). VCOS involves detecting objects that blend seamlessly in the surroundings for videos, due to similar colors and textures, poor light conditions, etc. Compared to the objects in normal scenes, camouflaged objects are much more difficult to detect. SAM2, a video foundation model, has shown potential in various tasks. But its effectiveness in dynamic camouflaged scenarios remains under-explored. This study presents a comprehensive study on SAM2’s ability in VCOS. First, we assess SAM2’s performance on camouflaged video datasets using different models and prompts (click, box, and mask). Second, we explore the integration of SAM2 with existing multimodal large language models (MLLMs) and VCOS methods. Third, we specifically adapt SAM2 by fine-tuning it on the video camouflaged dataset. Our comprehensive experiments demonstrate that SAM2 has excellent zero-shot ability of detecting camouflaged objects in videos. We also show that this ability could be further improved by specifically adjusting SAM2’s parameters for VCOS. The code will be available at this https URL

[CV-39] Enhanced Convolution Neural Network with Optimized Pooling and Hyperparameter Tuning for Network Intrusion Detection

链接: https://arxiv.org/abs/2409.18642
作者: Ayush Kumar Sharma,Sourav Patel,Supriya Bharat Wakchaure,Abirami S
关键词-EN: Denial of Service, Intrusion Detection Systems, protecting computer networks, Detection Systems, Network Intrusion Detection
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 Pages , 2 figures , 4 Tables , Conference paper

点击查看摘要

Abstract:Network Intrusion Detection Systems (NIDS) are essential for protecting computer networks from malicious activities, including Denial of Service (DoS), Probing, User-to-Root (U2R), and Remote-to-Local (R2L) attacks. Without effective NIDS, networks are vulnerable to significant security breaches and data loss. Machine learning techniques provide a promising approach to enhance NIDS by automating threat detection and improving accuracy. In this research, we propose an Enhanced Convolutional Neural Network (EnCNN) for NIDS and evaluate its performance using the KDDCUP’99 dataset. Our methodology includes comprehensive data preprocessing, exploratory data analysis (EDA), and feature engineering. We compare EnCNN with various machine learning algorithms, including Logistic Regression, Decision Trees, Support Vector Machines (SVM), and ensemble methods like Random Forest, AdaBoost, and Voting Ensemble. The results show that EnCNN significantly improves detection accuracy, with a notable 10% increase over state-of-art approaches. This demonstrates the effectiveness of EnCNN in real-time network intrusion detection, offering a robust solution for identifying and mitigating security threats, and enhancing overall network resilience.

[CV-40] Unsupervised Fingerphoto Presentation Attack Detection With Diffusion Models

链接: https://arxiv.org/abs/2409.18636
作者: Hailin Li,Raghavendra Ramachandra,Mohamed Ragab,Soumik Mondal,Yong Kiam Tan,Khin Mi Mi Aung
关键词-EN: Smartphone-based contactless fingerphoto, smartphone camera technology, biometric systems owing, traditional contact-based fingerprint, contact-based fingerprint biometric
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by IJCB 2024

点击查看摘要

Abstract:Smartphone-based contactless fingerphoto authentication has become a reliable alternative to traditional contact-based fingerprint biometric systems owing to rapid advances in smartphone camera technology. Despite its convenience, fingerprint authentication through fingerphotos is more vulnerable to presentation attacks, which has motivated recent research efforts towards developing fingerphoto Presentation Attack Detection (PAD) techniques. However, prior PAD approaches utilized supervised learning methods that require labeled training data for both bona fide and attack samples. This can suffer from two key issues, namely (i) generalization:the detection of novel presentation attack instruments (PAIs) unseen in the training data, and (ii) scalability:the collection of a large dataset of attack samples using different PAIs. To address these challenges, we propose a novel unsupervised approach based on a state-of-the-art deep-learning-based diffusion model, the Denoising Diffusion Probabilistic Model (DDPM), which is trained solely on bona fide samples. The proposed approach detects Presentation Attacks (PA) by calculating the reconstruction similarity between the input and output pairs of the DDPM. We present extensive experiments across three PAI datasets to test the accuracy and generalization capability of our approach. The results show that the proposed DDPM-based PAD method achieves significantly better detection error rates on several PAI classes compared to other baseline unsupervised approaches.

[CV-41] From One to the Power of Many: Augmentations for Invariance to Multi-LiDAR Perception from Single-Sensor Datasets

链接: https://arxiv.org/abs/2409.18592
作者: Marc Uecker,J. Marius Zöllner
关键词-EN: deep neural networks, experienced steep growth, LiDAR perception methods, powered by deep, classic benchmarks
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Recently, LiDAR perception methods for autonomous vehicles, powered by deep neural networks have experienced steep growth in performance on classic benchmarks, such as nuScenes and SemanticKITTI. However, there are still large gaps in performance when deploying models trained on such single-sensor setups to modern multi-sensor vehicles. In this work, we investigate if a lack of invariance may be responsible for these performance gaps, and propose some initial solutions in the form of application-specific data augmentations, which can facilitate better transfer to multi-sensor LiDAR setups. We provide experimental evidence that our proposed augmentations improve generalization across LiDAR sensor setups, and investigate how these augmentations affect the models’ invariance properties on simulations of different LiDAR sensor setups.

[CV-42] Off to new Shores: A Dataset Benchmark for (near-)coastal Flood Inundation Forecasting NEURIPS2024

链接: https://arxiv.org/abs/2409.18591
作者: Brandon Victor,Mathilde Letard,Peter Naylor,Karim Douch,Nicolas Longépé,Zhen He,Patrick Ebel
关键词-EN: devastating natural hazards, imposing immense costs, natural hazards, imposing immense, disastrous consequences
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at NeurIPS 2024 Datasets Benchmarks

点击查看摘要

Abstract:Floods are among the most common and devastating natural hazards, imposing immense costs on our society and economy due to their disastrous consequences. Recent progress in weather prediction and spaceborne flood mapping demonstrated the feasibility of anticipating extreme events and reliably detecting their catastrophic effects afterwards. However, these efforts are rarely linked to one another and there is a critical lack of datasets and benchmarks to enable the direct forecasting of flood extent. To resolve this issue, we curate a novel dataset enabling a timely prediction of flood extent. Furthermore, we provide a representative evaluation of state-of-the-art methods, structured into two benchmark tracks for forecasting flood inundation maps i) in general and ii) focused on coastal regions. Altogether, our dataset and benchmark provide a comprehensive platform for evaluating flood forecasts, enabling future solutions for this critical challenge. Data, code models are shared at this https URL under a CC0 license.

[CV-43] Cross-video Identity Correlating for Person Re-identification Pre-training NEURIPS2024

链接: https://arxiv.org/abs/2409.18569
作者: Jialong Zuo,Ying Nie,Hanyu Zhou,Huaxin Zhang,Haoyu Wang,Tianyu Guo,Nong Sang,Changxin Gao
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024 Accepted Paper

点击查看摘要

[CV-44] Harmonizing knowledge Transfer in Neural Network with Unified Distillation

链接: https://arxiv.org/abs/2409.18565
作者: Yaomin Huang,Zaomin Yan,Chaomin Shen,Faming Fang,Guixu Zhang
关键词-EN: garnering increasing attention, altering the architecture, increasing attention, garnering increasing, Knowledge
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Knowledge distillation (KD), known for its ability to transfer knowledge from a cumbersome network (teacher) to a lightweight one (student) without altering the architecture, has been garnering increasing attention. Two primary categories emerge within KD methods: feature-based, focusing on intermediate layers’ features, and logits-based, targeting the final layer’s logits. This paper introduces a novel perspective by leveraging diverse knowledge sources within a unified KD framework. Specifically, we aggregate features from intermediate layers into a comprehensive representation, effectively gathering semantic information from different stages and scales. Subsequently, we predict the distribution parameters from this representation. These steps transform knowledge from the intermediate layers into corresponding distributive forms, thereby allowing for knowledge distillation through a unified distribution constraint at different stages of the network, ensuring the comprehensiveness and coherence of knowledge transfer. Numerous experiments were conducted to validate the effectiveness of the proposed method.

[CV-45] AL-GTD: Deep Active Learning for Gaze Target Detection

链接: https://arxiv.org/abs/2409.18561
作者: Francesco Tonini,Nicola Dall’Asen,Lorenzo Vaquero,Cigdem Beyan,Elisa Ricci
关键词-EN: Gaze target detection, target detection aims, Gaze target, gaze target detectors, SOTA gaze target
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ACM Multimedia 2024

点击查看摘要

Abstract:Gaze target detection aims at determining the image location where a person is looking. While existing studies have made significant progress in this area by regressing accurate gaze heatmaps, these achievements have largely relied on access to extensive labeled datasets, which demands substantial human labor. In this paper, our goal is to reduce the reliance on the size of labeled training data for gaze target detection. To achieve this, we propose AL-GTD, an innovative approach that integrates supervised and self-supervised losses within a novel sample acquisition function to perform active learning (AL). Additionally, it utilizes pseudo-labeling to mitigate distribution shifts during the training phase. AL-GTD achieves the best of all AUC results by utilizing only 40-50% of the training data, in contrast to state-of-the-art (SOTA) gaze target detectors requiring the entire training dataset to achieve the same performance. Importantly, AL-GTD quickly reaches satisfactory performance with 10-20% of the training data, showing the effectiveness of our acquisition function, which is able to acquire the most informative samples. We provide a comprehensive experimental analysis by adapting several AL methods for the task. AL-GTD outperforms AL competitors, simultaneously exhibiting superior performance compared to SOTA gaze target detectors when all are trained within a low-data regime. Code is available at this https URL.

[CV-46] CodeSCAN: ScreenCast ANalysis for Video Programming Tutorials

链接: https://arxiv.org/abs/2409.18556
作者: Alexander Naumann,Felix Hertlein,Jacqueline Höllig,Lucas Cazzonelli,Steffen Thoma
关键词-EN: coding screencasts play, serving both novices, experienced developers, play a crucial, crucial role
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Programming tutorials in the form of coding screencasts play a crucial role in programming education, serving both novices and experienced developers. However, the video format of these tutorials presents a challenge due to the difficulty of searching for and within videos. Addressing the absence of large-scale and diverse datasets for screencast analysis, we introduce the CodeSCAN dataset. It comprises 12,000 screenshots captured from the Visual Studio Code environment during development, featuring 24 programming languages, 25 fonts, and over 90 distinct themes, in addition to diverse layout changes and realistic user interactions. Moreover, we conduct detailed quantitative and qualitative evaluations to benchmark the performance of Integrated Development Environment (IDE) element detection, color-to-black-and-white conversion, and Optical Character Recognition (OCR). We hope that our contributions facilitate more research in coding screencast analysis, and we make the source code for creating the dataset and the benchmark publicly available on this website.

[CV-47] Efficient Noise Mitigation for Enhancing Inference Accuracy in DNNs on Mixed-Signal Accelerators

链接: https://arxiv.org/abs/2409.18553
作者: Seyedarmin Azizi,Mohammad Erfan Sadeghi,Mehdi Kamal,Massoud Pedram
关键词-EN: analog computing components, analog neural networks, analog computing, framework to enhance, mitigating the effects
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we propose a framework to enhance the robustness of the neural models by mitigating the effects of process-induced and aging-related variations of analog computing components on the accuracy of the analog neural networks. We model these variations as the noise affecting the precision of the activations and introduce a denoising block inserted between selected layers of a pre-trained model. We demonstrate that training the denoising block significantly increases the model’s robustness against various noise levels. To minimize the overhead associated with adding these blocks, we present an exploration algorithm to identify optimal insertion points for the denoising blocks. Additionally, we propose a specialized architecture to efficiently execute the denoising blocks, which can be integrated into mixed-signal accelerators. We evaluate the effectiveness of our approach using Deep Neural Network (DNN) models trained on the ImageNet and CIFAR-10 datasets. The results show that on average, by accepting 2.03% parameter count overhead, the accuracy drop due to the variations reduces from 31.7% to 1.15%.

[CV-48] Reducing Semantic Ambiguity In Domain Adaptive Semantic Segmentation Via Probabilistic Prototypical Pixel Contrast

链接: https://arxiv.org/abs/2409.18543
作者: Xiaoke Hao,Shiyu Liu,Chuanbo Feng,Ye Zhu
关键词-EN: target domain caused, Domain adaptation aims, target domain, domain shift, domain caused
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: revise

点击查看摘要

Abstract:Domain adaptation aims to reduce the model degradation on the target domain caused by the domain shift between the source and target domains. Although encouraging performance has been achieved by combining cognitive learning with the self-training paradigm, they suffer from ambiguous scenarios caused by scale, illumination, or overlapping when deploying deterministic embedding. To address these issues, we propose probabilistic proto-typical pixel contrast (PPPC), a universal adaptation framework that models each pixel embedding as a probability via multivariate Gaussian distribution to fully exploit the uncertainty within them, eventually improving the representation quality of the model. In addition, we derive prototypes from probability estimation posterior probability estimation which helps to push the decision boundary away from the ambiguity points. Moreover, we employ an efficient method to compute similarity between distributions, eliminating the need for sampling and reparameterization, thereby significantly reducing computational overhead. Further, we dynamically select the ambiguous crops at the image level to enlarge the number of boundary points involved in contrastive learning, which benefits the establishment of precise distributions for each category. Extensive experimentation demonstrates that PPPC not only helps to address ambiguity at the pixel level, yielding discriminative representations but also achieves significant improvements in both synthetic-to-real and day-to-night adaptation tasks. It surpasses the previous state-of-the-art (SOTA) by +5.2% mIoU in the most challenging daytime-to-nighttime adaptation scenario, exhibiting stronger generalization on other unseen datasets. The code and models are available at this https URL.

[CV-49] How Effective is Pre-training of Large Masked Autoencoders for Downstream Earth Observation Tasks?

链接: https://arxiv.org/abs/2409.18536
作者: Jose Sosa,Mohamed Aloulou,Danila Rukhovich,Rim Sleimi,Boonyarit Changaival,Anis Kacem,Djamila Aouada
关键词-EN: computer vision tasks, proven highly effective, Self-supervised pre-training, data are scarce, Vision Transformer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Self-supervised pre-training has proven highly effective for many computer vision tasks, particularly when labelled data are scarce. In the context of Earth Observation (EO), foundation models and various other Vision Transformer (ViT)-based approaches have been successfully applied for transfer learning to downstream tasks. However, it remains unclear under which conditions pre-trained models offer significant advantages over training from scratch. In this study, we investigate the effectiveness of pre-training ViT-based Masked Autoencoders (MAE) for downstream EO tasks, focusing on reconstruction, segmentation, and classification. We consider two large ViT-based MAE pre-trained models: a foundation model (Prithvi) and SatMAE. We evaluate Prithvi on reconstruction and segmentation-based downstream tasks, and for SatMAE we assess its performance on a classification downstream task. Our findings suggest that pre-training is particularly beneficial when the fine-tuning task closely resembles the pre-training task, e.g. reconstruction. In contrast, for tasks such as segmentation or classification, training from scratch with specific hyperparameter adjustments proved to be equally or more effective.

[CV-50] Prompt-Driven Temporal Domain Adaptation for Nighttime UAV Tracking IROS2024

链接: https://arxiv.org/abs/2409.18533
作者: Changhong Fu,Yiheng Wang,Liangliang Yao,Guangze Zheng,Haobo Zuo,Jia Pan
关键词-EN: achieved great progress, Nighttime UAV tracking, UAV tracking, Nighttime UAV, UAV
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by IROS2024

点击查看摘要

Abstract:Nighttime UAV tracking under low-illuminated scenarios has achieved great progress by domain adaptation (DA). However, previous DA training-based works are deficient in narrowing the discrepancy of temporal contexts for UAV trackers. To address the issue, this work proposes a prompt-driven temporal domain adaptation training framework to fully utilize temporal contexts for challenging nighttime UAV tracking, i.e., TDA. Specifically, the proposed framework aligns the distribution of temporal contexts from daytime and nighttime domains by training the temporal feature generator against the discriminator. The temporal-consistent discriminator progressively extracts shared domain-specific features to generate coherent domain discrimination results in the time series. Additionally, to obtain high-quality training samples, a prompt-driven object miner is employed to precisely locate objects in unannotated nighttime videos. Moreover, a new benchmark for long-term nighttime UAV tracking is constructed. Exhaustive evaluations on both public and self-constructed nighttime benchmarks demonstrate the remarkable performance of the tracker trained in TDA framework, i.e., TDA-Track. Real-world tests at nighttime also show its practicality. The code and demo videos are available at this https URL.

[CV-51] oken Caching for Diffusion Transformer Acceleration

链接: https://arxiv.org/abs/2409.18523
作者: Jinming Lou,Wenyang Luo,Yufan Liu,Bing Li,Xinmiao Ding,Weiming Hu,Jiajiong Cao,Yuming Li,Chenguang Ma
关键词-EN: gained substantial interest, generative modeling due, diffusion generative modeling, gained substantial, substantial interest
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion transformers have gained substantial interest in diffusion generative modeling due to their outstanding performance. However, their high computational cost, arising from the quadratic computational complexity of attention mechanisms and multi-step inference, presents a significant bottleneck. To address this challenge, we propose TokenCache, a novel post-training acceleration method that leverages the token-based multi-block architecture of transformers to reduce redundant computations among tokens across inference steps. TokenCache specifically addresses three critical questions in the context of diffusion transformers: (1) which tokens should be pruned to eliminate redundancy, (2) which blocks should be targeted for efficient pruning, and (3) at which time steps caching should be applied to balance speed and quality. In response to these challenges, TokenCache introduces a Cache Predictor that assigns importance scores to tokens, enabling selective pruning without compromising model performance. Furthermore, we propose an adaptive block selection strategy to focus on blocks with minimal impact on the network’s output, along with a Two-Phase Round-Robin (TPRR) scheduling policy to optimize caching intervals throughout the denoising process. Experimental results across various models demonstrate that TokenCache achieves an effective trade-off between generation quality and inference speed for diffusion transformers. Our code will be publicly available.

[CV-52] Neural Video Representation for Redundancy Reduction and Consistency Preservation

链接: https://arxiv.org/abs/2409.18497
作者: Taiga Hayami,Takahiro Shindo,Shunsuke Akamatsu,Hiroshi Watanabe
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-53] mporal2Seq: A Unified Framework for Temporal Video Understanding Tasks

链接: https://arxiv.org/abs/2409.18478
作者: Min Yang,Zichen Zhang,Limin Wang
关键词-EN: event boundary detection, temporal action detection, including temporal action, temporal action segmentation, generic event boundary
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the development of video understanding, there is a proliferation of tasks for clip-level temporal video analysis, including temporal action detection (TAD), temporal action segmentation (TAS), and generic event boundary detection (GEBD). While task-specific video understanding models have exhibited outstanding performance in each task, there remains a dearth of a unified framework capable of simultaneously addressing multiple tasks, which is a promising direction for the next generation of AI. To this end, in this paper, we propose a single unified framework, coined as Temporal2Seq, to formulate the output of these temporal video understanding tasks as a sequence of discrete tokens. With this unified token representation, Temporal2Seq can train a generalist model within a single architecture on different video understanding tasks. In the absence of multi-task learning (MTL) benchmarks, we compile a comprehensive co-training dataset by borrowing the datasets from TAD, TAS, and GEBD tasks. We evaluate our Temporal2Seq generalist model on the corresponding test sets of three tasks, demonstrating that Temporal2Seq can produce reasonable results on various tasks and achieve advantages compared with single-task training on this framework. We also investigate the generalization performance of our generalist model on new datasets from different tasks, which yields superior performance to the specific model.

[CV-54] Underwater Image Enhancement with Physical-based Denoising Diffusion Implicit Models

链接: https://arxiv.org/abs/2409.18476
作者: Nguyen Gia Bach,Chanh Minh Tran,Eiji Kamioka,Phan Xuan Tan
关键词-EN: autonomous underwater vehicles, enhancing degraded underwater, resource-constrained AUV, degraded underwater images, sufficient model computational
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Underwater vision is crucial for autonomous underwater vehicles (AUVs), and enhancing degraded underwater images in real-time on a resource-constrained AUV is a key challenge due to factors like light absorption and scattering, or the sufficient model computational complexity to resolve such factors. Traditional image enhancement techniques lack adaptability to varying underwater conditions, while learning-based methods, particularly those using convolutional neural networks (CNNs) and generative adversarial networks (GANs), offer more robust solutions but face limitations such as inadequate enhancement, unstable training, or mode collapse. Denoising diffusion probabilistic models (DDPMs) have emerged as a state-of-the-art approach in image-to-image tasks but require intensive computational complexity to achieve the desired underwater image enhancement (UIE) using the recent UW-DDPM solution. To address these challenges, this paper introduces UW-DiffPhys, a novel physical-based and diffusion-based UIE approach. UW-DiffPhys combines light-computation physical-based UIE network components with a denoising U-Net to replace the computationally intensive distribution transformation U-Net in the existing UW-DDPM framework, reducing complexity while maintaining performance. Additionally, the Denoising Diffusion Implicit Model (DDIM) is employed to accelerate the inference process through non-Markovian sampling. Experimental results demonstrate that UW-DiffPhys achieved a substantial reduction in computational complexity and inference time compared to UW-DDPM, with competitive performance in key metrics such as PSNR, SSIM, UCIQE, and an improvement in the overall underwater image quality UIQM metric. The implementation code can be found at the following repository: this https URL

[CV-55] owards Diverse Device Heterogeneous Federated Learning via Task Arithmetic Knowledge Integration NEURIPS2024

链接: https://arxiv.org/abs/2409.18461
作者: Mahdi Morafah,Vyacheslav Kungurtsev,Hojin Chang,Chen Chen,Bill Lin
关键词-EN: user data privacy, preserving user data, collaborative machine learning, Federated Learning, data privacy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Federated Learning has emerged as a promising paradigm for collaborative machine learning, while preserving user data privacy. Despite its potential, standard FL lacks support for diverse heterogeneous device prototypes, which vary significantly in model and dataset sizes – from small IoT devices to large workstations. This limitation is only partially addressed by existing knowledge distillation techniques, which often fail to transfer knowledge effectively across a broad spectrum of device prototypes with varied capabilities. This failure primarily stems from two issues: the dilution of informative logits from more capable devices by those from less capable ones, and the use of a single integrated logits as the distillation target across all devices, which neglects their individual learning capacities and and the unique contributions of each. To address these challenges, we introduce TAKFL, a novel KD-based framework that treats the knowledge transfer from each device prototype’s ensemble as a separate task, independently distilling each to preserve its unique contributions and avoid dilution. TAKFL also incorporates a KD-based self-regularization technique to mitigate the issues related to the noisy and unsupervised ensemble distillation process. To integrate the separately distilled knowledge, we introduce an adaptive task arithmetic knowledge integration process, allowing each student model to customize the knowledge integration for optimal performance. Additionally, we present theoretical results demonstrating the effectiveness of task arithmetic in transferring knowledge across heterogeneous devices with varying capacities. Comprehensive evaluations of our method across both CV and NLP tasks demonstrate that TAKFL achieves SOTA results in a variety of datasets and settings, significantly outperforming existing KD-based methods. Code is released at this https URL

[CV-56] FoodMLLM-JP: Leveraging Multimodal Large Language Models for Japanese Recipe Generation

链接: https://arxiv.org/abs/2409.18459
作者: Yuki Imajuku,Yoko Yamakata,Kiyoharu Aizawa
关键词-EN: long-standing focus due, long-standing focus, focus due, diversity and complexity, Multimodal Large Language
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: 14 pages, 5 figures

点击查看摘要

Abstract:Research on food image understanding using recipe data has been a long-standing focus due to the diversity and complexity of the data. Moreover, food is inextricably linked to people’s lives, making it a vital research area for practical applications such as dietary management. Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities, not only in their vast knowledge but also in their ability to handle languages naturally. While English is predominantly used, they can also support multiple languages including Japanese. This suggests that MLLMs are expected to significantly improve performance in food image understanding tasks. We fine-tuned open MLLMs LLaVA-1.5 and Phi-3 Vision on a Japanese recipe dataset and benchmarked their performance against the closed model GPT-4o. We then evaluated the content of generated recipes, including ingredients and cooking procedures, using 5,000 evaluation samples that comprehensively cover Japanese food culture. Our evaluation demonstrates that the open models trained on recipe data outperform GPT-4o, the current state-of-the-art model, in ingredient generation. Our model achieved F1 score of 0.531, surpassing GPT-4o’s F1 score of 0.481, indicating a higher level of accuracy. Furthermore, our model exhibited comparable performance to GPT-4o in generating cooking procedure text.

[CV-57] Enhancing Crime Scene Investigations through Virtual Reality and Deep Learning Techniques

链接: https://arxiv.org/abs/2409.18458
作者: Antonino Zappalà(1),Luca Guarnera(1),Vincenzo Rinaldi(2),Salvatore Livatino(3),Sebastiano Battiato(1) ((1) University of Catania, (2) University of Dundee, (3) University of Hertfordshire)
关键词-EN: crime scene, Crime Scene Investigators, crime scene analysis, scene, pivotal activity
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The analysis of a crime scene is a pivotal activity in forensic investigations. Crime Scene Investigators and forensic science practitioners rely on best practices, standard operating procedures, and critical thinking, to produce rigorous scientific reports to document the scenes of interest and meet the quality standards expected in the courts. However, crime scene examination is a complex and multifaceted task often performed in environments susceptible to deterioration, contamination, and alteration, despite the use of contact-free and non-destructive methods of analysis. In this context, the documentation of the sites, and the identification and isolation of traces of evidential value remain challenging endeavours. In this paper, we propose a photogrammetric reconstruction of the crime scene for inspection in virtual reality (VR) and focus on fully automatic object recognition with deep learning (DL) algorithms through a client-server architecture. A pre-trained Faster-RCNN model was chosen as the best method that can best categorize relevant objects at the scene, selected by experts in the VR environment. These operations can considerably improve and accelerate crime scene analysis and help the forensic expert in extracting measurements and analysing in detail the objects under analysis. Experimental results on a simulated crime scene have shown that the proposed method can be effective in finding and recognizing objects with potential evidentiary value, enabling timely analyses of crime scenes, particularly those with health and safety risks (e.g. fires, explosions, chemicals, etc.), while minimizing subjective bias and contamination of the scene.

[CV-58] DynaWeightPnP: Toward global real-time 3D-2D solver in PnP without correspondences

链接: https://arxiv.org/abs/2409.18457
作者: Jingwei Song,Maani Ghaffari
关键词-EN: Kernel Hilbert Space, Reproducing Kernel Hilbert, addresses a special, estimating the optimal, paper addresses
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This paper addresses a special Perspective-n-Point (PnP) problem: estimating the optimal pose to align 3D and 2D shapes in real-time without correspondences, termed as correspondence-free PnP. While several studies have focused on 3D and 2D shape registration, achieving both real-time and accurate performance remains challenging. This study specifically targets the 3D-2D geometric shape registration tasks, applying the recently developed Reproducing Kernel Hilbert Space (RKHS) to address the “big-to-small” issue. An iterative reweighted least squares method is employed to solve the RKHS-based formulation efficiently. Moreover, our work identifies a unique and interesting observability issue in correspondence-free PnP: the numerical ambiguity between rotation and translation. To address this, we proposed DynaWeightPnP, introducing a dynamic weighting sub-problem and an alternative searching algorithm designed to enhance pose estimation and alignment accuracy. Experiments were conducted on a typical case, that is, a 3D-2D vascular centerline registration task within Endovascular Image-Guided Interventions (EIGIs). Results demonstrated that the proposed algorithm achieves registration processing rates of 60 Hz (without post-refinement) and 31 Hz (with post-refinement) on modern single-core CPUs, with competitive accuracy comparable to existing methods. These results underscore the suitability of DynaWeightPnP for future robot navigation tasks like EIGIs.

[CV-59] Gradient-free Decoder Inversion in Latent Diffusion Models NEURIPS2024

链接: https://arxiv.org/abs/2409.18442
作者: Seongmin Hong,Suh Yoon Jeon,Kyeonghyun Lee,Ernest K. Ryu,Se Young Chun
关键词-EN: denoising diffusion process, diffusion process efficiently, denoising diffusion, diffusion process, process efficiently
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 19 pages, Accepted to NeurIPS 2024

点击查看摘要

Abstract:In latent diffusion models (LDMs), denoising diffusion process efficiently takes place on latent space whose dimension is lower than that of pixel space. Decoder is typically used to transform the representation in latent space to that in pixel space. While a decoder is assumed to have an encoder as an accurate inverse, exact encoder-decoder pair rarely exists in practice even though applications often require precise inversion of decoder. Prior works for decoder inversion in LDMs employed gradient descent inspired by inversions of generative adversarial networks. However, gradient-based methods require larger GPU memory and longer computation time for larger latent space. For example, recent video LDMs can generate more than 16 frames, but GPUs with 24 GB memory can only perform gradient-based decoder inversion for 4 frames. Here, we propose an efficient gradient-free decoder inversion for LDMs, which can be applied to diverse latent models. Theoretical convergence property of our proposed inversion has been investigated not only for the forward step method, but also for the inertial Krasnoselskii-Mann (KM) iterations under mild assumption on cocoercivity that is satisfied by recent LDMs. Our proposed gradient-free method with Adam optimizer and learning rate scheduling significantly reduced computation time and memory usage over prior gradient-based methods and enabled efficient computation in applications such as noise-space watermarking while achieving comparable error levels.

[CV-60] Search3D: Hierarchical Open-Vocabulary 3D Segmentation

链接: https://arxiv.org/abs/2409.18431
作者: Ayca Takmaz,Alexandros Delitzas,Robert W. Sumner,Francis Engelmann,Johanna Wald,Federico Tombari
关键词-EN: free-form text descriptions, spaces using free-form, text descriptions, enables the exploration, free-form text
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Open-vocabulary 3D segmentation enables the exploration of 3D spaces using free-form text descriptions. Existing methods for open-vocabulary 3D instance segmentation primarily focus on identifying object-level instances in a scene. However, they face challenges when it comes to understanding more fine-grained scene entities such as object parts, or regions described by generic attributes. In this work, we introduce Search3D, an approach that builds a hierarchical open-vocabulary 3D scene representation, enabling the search for entities at varying levels of granularity: fine-grained object parts, entire objects, or regions described by attributes like materials. Our method aims to expand the capabilities of open vocabulary instance-level 3D segmentation by shifting towards a more flexible open-vocabulary 3D search setting less anchored to explicit object-centric queries, compared to prior work. To ensure a systematic evaluation, we also contribute a scene-scale open-vocabulary 3D part segmentation benchmark based on MultiScan, along with a set of open-vocabulary fine-grained part annotations on ScanNet++. We verify the effectiveness of Search3D across several tasks, demonstrating that our approach outperforms baselines in scene-scale open-vocabulary 3D part segmentation, while maintaining strong performance in segmenting 3D objects and materials.

[CV-61] Robust Network Learning via Inverse Scale Variational Sparsification

链接: https://arxiv.org/abs/2409.18419
作者: Zhiling Zhou,Zirui Liu,Chengming Xu,Yanwei Fu,Xinwei Sun
关键词-EN: made significant strides, including natural corruptions, noise types, including natural, low-resolution artifacts
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 21 pages, 7 figures

点击查看摘要

Abstract:While neural networks have made significant strides in many AI tasks, they remain vulnerable to a range of noise types, including natural corruptions, adversarial noise, and low-resolution artifacts. Many existing approaches focus on enhancing robustness against specific noise types, limiting their adaptability to others. Previous studies have addressed general robustness by adopting a spectral perspective, which tends to blur crucial features like texture and object contours. Our proposed solution, however, introduces an inverse scale variational sparsification framework within a time-continuous inverse scale space formulation. This framework progressively learns finer-scale features by discerning variational differences between pixels, ultimately preserving only large-scale features in the smoothed image. Unlike frequency-based methods, our approach not only removes noise by smoothing small-scale features where corruptions often occur but also retains high-contrast details such as textures and object contours. Moreover, our framework offers simplicity and efficiency in implementation. By integrating this algorithm into neural network training, we guide the model to prioritize learning large-scale features. We show the efficacy of our approach through enhanced robustness against various noise types.

[CV-62] A3: Active Adversarial Alignment for Source-Free Domain Adaptation ICML

链接: https://arxiv.org/abs/2409.18418
作者: Chrisantus Eze,Christopher Crick
关键词-EN: Unsupervised domain adaptation, aims to transfer, source-free UDA, Unsupervised domain, transfer knowledge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ICMLA 2024

点击查看摘要

Abstract:Unsupervised domain adaptation (UDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain. Recent works have focused on source-free UDA, where only target data is available. This is challenging as models rely on noisy pseudo-labels and struggle with distribution shifts. We propose Active Adversarial Alignment (A3), a novel framework combining self-supervised learning, adversarial training, and active learning for robust source-free UDA. A3 actively samples informative and diverse data using an acquisition function for training. It adapts models via adversarial losses and consistency regularization, aligning distributions without source data access. A3 advances source-free UDA through its synergistic integration of active and adversarial learning for effective domain alignment and noise reduction.

[CV-63] Query matching for spatio-temporal action detection with query-based object detector

链接: https://arxiv.org/abs/2409.18408
作者: Shimon Hori,Kazuki Omi,Toru Tamaki
关键词-EN: spatio-temporal action detection, requires maintaining temporal, maintaining temporal consistency, object detection model, query-based object detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we propose a method that extends the query-based object detection model, DETR, to spatio-temporal action detection, which requires maintaining temporal consistency in videos. Our proposed method applies DETR to each frame and uses feature shift to incorporate temporal information. However, DETR’s object queries in each frame may correspond to different objects, making a simple feature shift ineffective. To overcome this issue, we propose query matching across different frames, ensuring that queries for the same object are matched and used for the feature shift. Experimental results show that performance on the JHMDB21 dataset improves significantly when query features are shifted using the proposed query matching.

[CV-64] GenesisTex2: Stable Consistent and High-Quality Text-to-Texture Generation

链接: https://arxiv.org/abs/2409.18401
作者: Jiawei Lu,Yingpeng Zhang,Zengjun Zhao,He Wang,Kun Zhou,Tianjia Shao
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[CV-65] You Only Speak Once to See ICASSP2025

链接: https://arxiv.org/abs/2409.18372
作者: Wenhao Yang,Jianguo Wei,Wenhuan Lu,Lei Li
关键词-EN: grounding remains underexplored, remains underexplored, Grounding, grounding remains, termed Audio Grounding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 4 figures, submitted to ICASSP 2025

点击查看摘要

Abstract:Grounding objects in images using visual cues is a well-established approach in computer vision, yet the potential of audio as a modality for object recognition and grounding remains underexplored. We introduce YOSS, “You Only Speak Once to See,” to leverage audio for grounding objects in visual scenes, termed Audio Grounding. By integrating pre-trained audio models with visual models using contrastive learning and multi-modal alignment, our approach captures speech commands or descriptions and maps them directly to corresponding objects within images. Experimental results indicate that audio guidance can be effectively applied to object grounding, suggesting that incorporating audio guidance may enhance the precision and robustness of current object grounding methods and improve the performance of robotic systems and computer vision applications. This finding opens new possibilities for advanced object recognition, scene understanding, and the development of more intuitive and capable robotic systems.

[CV-66] Multi-hypotheses Conditioned Point Cloud Diffusion for 3D Human Reconstruction from Occluded Images NEURIPS2024

链接: https://arxiv.org/abs/2409.18364
作者: Donghwan Kim,Tae-Kyun Kim
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 17 pages, 7 figures, accepted NeurIPS 2024

点击查看摘要

[CV-67] SinoSynth: A Physics-based Domain Randomization Approach for Generalizable CBCT Image Enhancement MICCAI2024

链接: https://arxiv.org/abs/2409.18355
作者: Yunkui Pang,Yilin Liu,Xu Chen,Pew-Thian Yap,Jun Lian
关键词-EN: Cone Beam Computed, Beam Computed Tomography, Cone Beam, Computed Tomography, Beam Computed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: MICCAI 2024

点击查看摘要

Abstract:Cone Beam Computed Tomography (CBCT) finds diverse applications in medicine. Ensuring high image quality in CBCT scans is essential for accurate diagnosis and treatment delivery. Yet, the susceptibility of CBCT images to noise and artifacts undermines both their usefulness and reliability. Existing methods typically address CBCT artifacts through image-to-image translation approaches. These methods, however, are limited by the artifact types present in the training data, which may not cover the complete spectrum of CBCT degradations stemming from variations in imaging protocols. Gathering additional data to encompass all possible scenarios can often pose a challenge. To address this, we present SinoSynth, a physics-based degradation model that simulates various CBCT-specific artifacts to generate a diverse set of synthetic CBCT images from high-quality CT images without requiring pre-aligned data. Through extensive experiments, we demonstrate that several different generative networks trained on our synthesized data achieve remarkable results on heterogeneous multi-institutional datasets, outperforming even the same networks trained on actual data. We further show that our degradation model conveniently provides an avenue to enforce anatomical constraints in conditional generative models, yielding high-quality and structure-preserving synthetic CT images.

[CV-68] MultiClimate: Multimodal Stance Detection on Climate Change Videos

链接: https://arxiv.org/abs/2409.18346
作者: Jiawen Wang,Longfei Zuo,Siyao Peng,Barbara Plank
关键词-EN: attracted increasing attention, Climate change, attention in NLP, NLP in recent, recent years
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 1 figure

点击查看摘要

Abstract:Climate change (CC) has attracted increasing attention in NLP in recent years. However, detecting the stance on CC in multimodal data is understudied and remains challenging due to a lack of reliable datasets. To improve the understanding of public opinions and communication strategies, this paper presents MultiClimate, the first open-source manually-annotated stance detection dataset with 100 CC-related YouTube videos and 4,209 frame-transcript pairs. We deploy state-of-the-art vision and language models, as well as multimodal models for MultiClimate stance detection. Results show that text-only BERT significantly outperforms image-only ResNet50 and ViT. Combining both modalities achieves state-of-the-art, 0.747 / 0.749 in accuracy/F1. Our 100M-sized fusion models also beat CLIP and BLIP, as well as the much larger 9B-sized multimodal IDEFICS and text-only Llama3 and Gemma2, indicating that multimodal stance detection remains challenging for large language models. Our code, dataset, as well as supplementary materials, are available at this https URL.

[CV-69] Does End-to-End Autonomous Driving Really Need Perception Tasks?

链接: https://arxiv.org/abs/2409.18341
作者: Peidong Li,Dixiao Cui
关键词-EN: extract explicit scene, supervised perception tasks, methods typically rely, explicit scene information, Sparse Scene Representation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical Report

点击查看摘要

Abstract:End-to-End Autonomous Driving (E2EAD) methods typically rely on supervised perception tasks to extract explicit scene information (e.g., objects, maps). This reliance necessitates expensive annotations and constrains deployment and data scalability in real-time applications. In this paper, we introduce SSR, a novel framework that utilizes only 16 navigation-guided tokens as Sparse Scene Representation, efficiently extracting crucial scene information for E2EAD. Our method eliminates the need for supervised sub-tasks, allowing computational resources to concentrate on essential elements directly related to navigation intent. We further introduce a temporal enhancement module that employs a Bird’s-Eye View (BEV) world model, aligning predicted future scenes with actual future scenes through self-supervision. SSR achieves state-of-the-art planning performance on the nuScenes dataset, demonstrating a 27.2% relative reduction in L2 error and a 51.6% decrease in collision rate to the leading E2EAD method, UniAD. Moreover, SSR offers a 10.9 \times faster inference speed and 13 \times faster training time. This framework represents a significant leap in real-time autonomous driving systems and paves the way for future scalable deployment. Code will be released at \urlthis https URL.

[CV-70] DeBaRA: Denoising-Based 3D Room Arrangement Generation NEURIPS2024

链接: https://arxiv.org/abs/2409.18336
作者: Léopold Maillard,Nicolas Sereyjol-Garros,Tom Durand,Maks Ovsjanikov
关键词-EN: unlocks multiple interactive, Generating realistic, scenes unlocks multiple, multiple interactive applications, interactive applications impacting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at NeurIPS 2024. Preprint version

点击查看摘要

Abstract:Generating realistic and diverse layouts of furnished indoor 3D scenes unlocks multiple interactive applications impacting a wide range of industries. The inherent complexity of object interactions, the limited amount of available data and the requirement to fulfill spatial constraints all make generative modeling for 3D scene synthesis and arrangement challenging. Current methods address these challenges autoregressively or by using off-the-shelf diffusion objectives by simultaneously predicting all attributes without 3D reasoning considerations. In this paper, we introduce DeBaRA, a score-based model specifically tailored for precise, controllable and flexible arrangement generation in a bounded environment. We argue that the most critical component of a scene synthesis system is to accurately establish the size and position of various objects within a restricted area. Based on this insight, we propose a lightweight conditional score-based model designed with 3D spatial awareness at its core. We demonstrate that by focusing on spatial attributes of objects, a single trained DeBaRA model can be leveraged at test time to perform several downstream applications such as scene synthesis, completion and re-arrangement. Further, we introduce a novel Self Score Evaluation procedure so it can be optimally employed alongside external LLM models. We evaluate our approach through extensive experiments and demonstrate significant improvement upon state-of-the-art approaches in a range of scenarios.

[CV-71] Automated Segmentation and Analysis of Microscopy Images of Laser Powder Bed Fusion Melt Tracks

链接: https://arxiv.org/abs/2409.18326
作者: Aagam Shah,Reimar Weissbach,David A. Griggs,A. John Hart,Elif Ertekin,Sameh Tawfick
关键词-EN: metal additive manufacturing, optimise printing conditions, additive manufacturing, researchers and practitioners, printing conditions
类目: Computer Vision and Pattern Recognition (cs.CV); Applied Physics (physics.app-ph)
*备注: 21 pages, 10 figures

点击查看摘要

Abstract:With the increasing adoption of metal additive manufacturing (AM), researchers and practitioners are turning to data-driven approaches to optimise printing conditions. Cross-sectional images of melt tracks provide valuable information for tuning process parameters, developing parameter scaling data, and identifying defects. Here we present an image segmentation neural network that automatically identifies and measures melt track dimensions from a cross-section image. We use a U-Net architecture to train on a data set of 62 pre-labelled images obtained from different labs, machines, and materials coupled with image augmentation. When neural network hyperparameters such as batch size and learning rate are properly tuned, the learned model shows an accuracy for classification of over 99% and an F1 score over 90%. The neural network exhibits robustness when tested on images captured by various users, printed on different machines, and acquired using different microscopes. A post-processing module extracts the height and width of the melt pool, and the wetting angles. We discuss opportunities to improve model performance and avenues for transfer learning, such as extension to other AM processes such as directed energy deposition.

[CV-72] Realistic Evaluation of Model Merging for Compositional Generalization

链接: https://arxiv.org/abs/2409.18314
作者: Derek Tam,Yash Kant,Brian Lester,Igor Gilitschenski,Colin Raffel
关键词-EN: cheaply combine individual, combine individual models, attains better performance, cheaply combine, combine individual
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Merging has become a widespread way to cheaply combine individual models into a single model that inherits their capabilities and attains better performance. This popularity has spurred rapid development of many new merging methods, which are typically validated in disparate experimental settings and frequently differ in the assumptions made about model architecture, data availability, and computational budget. In this work, we characterize the relative merits of different merging methods by evaluating them in a shared experimental setting and precisely identifying the practical requirements of each method. Specifically, our setting focuses on using merging for compositional generalization of capabilities in image classification, image generation, and natural language processing. Additionally, we measure the computational costs of different merging methods as well as how they perform when scaling the number of models being merged. Taken together, our results clarify the state of the field of model merging and provide a comprehensive and rigorous experimental setup to test new methods.

[CV-73] Harnessing Wavelet Transformations for Generalizable Deepfake Forgery Detection

链接: https://arxiv.org/abs/2409.18301
作者: Lalith Bharadwaj Baru,Shilhora Akshay Patel,Rohit Boddeda
关键词-EN: digital image manipulation, significantly challenges existing, deep generative models, challenges existing deepfake, significantly challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The evolution of digital image manipulation, particularly with the advancement of deep generative models, significantly challenges existing deepfake detection methods, especially when the origin of the deepfake is obscure. To tackle the increasing complexity of these forgeries, we propose \textbfWavelet-CLIP, a deepfake detection framework that integrates wavelet transforms with features derived from the ViT-L/14 architecture, pre-trained in the CLIP fashion. Wavelet-CLIP utilizes Wavelet Transforms to deeply analyze both spatial and frequency features from images, thus enhancing the model’s capability to detect sophisticated deepfakes. To verify the effectiveness of our approach, we conducted extensive evaluations against existing state-of-the-art methods for cross-dataset generalization and detection of unseen images generated by standard diffusion models. Our method showcases outstanding performance, achieving an average AUC of 0.749 for cross-data generalization and 0.893 for robustness against unseen deepfakes, outperforming all compared methods. The code can be reproduced from the repo: \urlthis https URL

[CV-74] SOAR: Self-supervision Optimized UAV Action Recognition with Efficient Object-Aware Pretraining

链接: https://arxiv.org/abs/2409.18300
作者: Ruiqi Xian,Xiyang Wu,Tianrui Guan,Xijun Wang,Boqing Gong,Dinesh Manocha
关键词-EN: Unmanned Aerial Vehicles, aerial footage captured, Aerial Vehicles, Unmanned Aerial, captured by Unmanned
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:We introduce SOAR, a novel Self-supervised pretraining algorithm for aerial footage captured by Unmanned Aerial Vehicles (UAVs). We incorporate human object knowledge throughout the pretraining process to enhance UAV video pretraining efficiency and downstream action recognition performance. This is in contrast to prior works that primarily incorporate object information during the fine-tuning stage. Specifically, we first propose a novel object-aware masking strategy designed to retain the visibility of certain patches related to objects throughout the pretraining phase. Second, we introduce an object-aware loss function that utilizes object information to adjust the reconstruction loss, preventing bias towards less informative background patches. In practice, SOAR with a vanilla ViT backbone, outperforms best UAV action recognition models, recording a 9.7% and 21.4% boost in top-1 accuracy on the NEC-Drone and UAV-Human datasets, while delivering an inference speed of 18.7ms per video, making it 2x to 5x faster. Additionally, SOAR obtains comparable accuracy to prior self-supervised learning (SSL) methods while requiring 87.5% less pretraining time and 25% less memory usage

[CV-75] FlatnFold: A Diverse Multi-Modal Dataset for Garment Perception and Manipulation

链接: https://arxiv.org/abs/2409.18297
作者: Lipeng Zhuang,Shiyu Fan,Yingdong Ru,Florent Audonnet,Paul Henderson,Gerardo Aragon-Camarasa
关键词-EN: addresses critical gaps, addresses critical, critical gaps, large-scale dataset, dataset
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present Flat’n’Fold, a novel large-scale dataset for garment manipulation that addresses critical gaps in existing datasets. Comprising 1,212 human and 887 robot demonstrations of flattening and folding 44 unique garments across 8 categories, Flat’n’Fold surpasses prior datasets in size, scope, and diversity. Our dataset uniquely captures the entire manipulation process from crumpled to folded states, providing synchronized multi-view RGB-D images, point clouds, and action data, including hand or gripper positions and rotations. We quantify the dataset’s diversity and complexity compared to existing benchmarks and show that our dataset features natural and diverse manipulations of real-world demonstrations of human and robot demonstrations in terms of visual and action information. To showcase Flat’n’Fold’s utility, we establish new benchmarks for grasping point prediction and subtask decomposition. Our evaluation of state-of-the-art models on these tasks reveals significant room for improvement. This underscores Flat’n’Fold’s potential to drive advances in robotic perception and manipulation of deformable objects. Our dataset can be downloaded at this https URL

[CV-76] Efficient Microscopic Image Instance Segmentation for Food Crystal Quality Control

链接: https://arxiv.org/abs/2409.18291
作者: Xiaoyu Ji,Jan P Allebach,Ali Shakouri,Fengqing Zhu
关键词-EN: quality control area, efficiently predicting food, crystal quality control, area for manufacturing, focusing on efficiently
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper is directed towards the food crystal quality control area for manufacturing, focusing on efficiently predicting food crystal counts and size distributions. Previously, manufacturers used the manual counting method on microscopic images of food liquid products, which requires substantial human effort and suffers from inconsistency issues. Food crystal segmentation is a challenging problem due to the diverse shapes of crystals and their surrounding hard mimics. To address this challenge, we propose an efficient instance segmentation method based on object detection. Experimental results show that the predicted crystal counting accuracy of our method is comparable with existing segmentation methods, while being five times faster. Based on our experiments, we also define objective criteria for separating hard mimics and food crystals, which could benefit manual annotation tasks on similar dataset.

[CV-77] Advancing Object Detection in Transportation with Multimodal Large Language Models (MLLMs): A Comprehensive Review and Empirical Testing

链接: https://arxiv.org/abs/2409.18286
作者: Huthaifa I. Ashqar,Ahmed Jaber,Taqwa I. Alhadidi,Mohammed Elhenawy
关键词-EN: Large Vision Models, large language models, multimodal large language, Vision Models, Large Vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:This study aims to comprehensively review and empirically evaluate the application of multimodal large language models (MLLMs) and Large Vision Models (VLMs) in object detection for transportation systems. In the first fold, we provide a background about the potential benefits of MLLMs in transportation applications and conduct a comprehensive review of current MLLM technologies in previous studies. We highlight their effectiveness and limitations in object detection within various transportation scenarios. The second fold involves providing an overview of the taxonomy of end-to-end object detection in transportation applications and future directions. Building on this, we proposed empirical analysis for testing MLLMs on three real-world transportation problems that include object detection tasks namely, road safety attributes extraction, safety-critical event detection, and visual reasoning of thermal images. Our findings provide a detailed assessment of MLLM performance, uncovering both strengths and areas for improvement. Finally, we discuss practical limitations and challenges of MLLMs in enhancing object detection in transportation, thereby offering a roadmap for future research and development in this critical area.

[CV-78] ask-recency bias strikes back: Adapting covariances in Exemplar-Free Class Incremental Learning NEURIPS2024

链接: https://arxiv.org/abs/2409.18265
作者: Grzegorz Rypeść,Sebastian Cygert,Tomasz Trzciński,Bartłomiej Twardowski
关键词-EN: Class Incremental Learning, Exemplar-Free Class Incremental, Exemplar-Free Class, Incremental Learning, Class Incremental
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for NeurIPS 2024

点击查看摘要

Abstract:Exemplar-Free Class Incremental Learning (EFCIL) tackles the problem of training a model on a sequence of tasks without access to past data. Existing state-of-the-art methods represent classes as Gaussian distributions in the feature extractor’s latent space, enabling Bayes classification or training the classifier by replaying pseudo features. However, we identify two critical issues that compromise their efficacy when the feature extractor is updated on incremental tasks. First, they do not consider that classes’ covariance matrices change and must be adapted after each task. Second, they are susceptible to a task-recency bias caused by dimensionality collapse occurring during training. In this work, we propose AdaGauss – a novel method that adapts covariance matrices from task to task and mitigates the task-recency bias owing to the additional anti-collapse loss function. AdaGauss yields state-of-the-art results on popular EFCIL benchmarks and datasets when training from scratch or starting from a pre-trained backbone. The code is available at: this https URL.

[CV-79] Omni6D: Large-Vocabulary 3D Object Dataset for Category-Level 6D Object Pose Estimation ECCV2024

链接: https://arxiv.org/abs/2409.18261
作者: Mengchen Zhang,Tong Wu,Tai Wang,Tengfei Wang,Ziwei Liu,Dahua Lin
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: ECCV 2024 (poster). Github page: this https URL

点击查看摘要

[CV-80] PCEvE: Part Contribution Evaluation Based Model Explanation for Human Figure Drawing Assessment and Beyond

链接: https://arxiv.org/abs/2409.18260
作者: Jongseo Lee,Geo Ahn,Jinwoo Choi,Seongtae Kim
关键词-EN: human figure drawing, autism spectrum disorder, automatic human figure, diagnosing autism spectrum, figure drawing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:For automatic human figure drawing (HFD) assessment tasks, such as diagnosing autism spectrum disorder (ASD) using HFD images, the clarity and explainability of a model decision are crucial. Existing pixel-level attribution-based explainable AI (XAI) approaches demand considerable effort from users to interpret the semantic information of a region in an image, which can be often time-consuming and impractical. To overcome this challenge, we propose a part contribution evaluation based model explanation (PCEvE) framework. On top of the part detection, we measure the Shapley Value of each individual part to evaluate the contribution to a model decision. Unlike existing attribution-based XAI approaches, the PCEvE provides a straightforward explanation of a model decision, i.e., a part contribution histogram. Furthermore, the PCEvE expands the scope of explanations beyond the conventional sample-level to include class-level and task-level insights, offering a richer, more comprehensive understanding of model behavior. We rigorously validate the PCEvE via extensive experiments on multiple HFD assessment datasets. Also, we sanity-check the proposed method with a set of controlled experiments. Additionally, we demonstrate the versatility and applicability of our method to other domains by applying it to a photo-realistic dataset, the Stanford Cars.

[CV-81] Amodal Instance Segmentation with Diffusion Shape Prior Estimation ACCV2024

链接: https://arxiv.org/abs/2409.18256
作者: Minh Tran,Khoa Vo,Tri Nguyen,Ngan Le
关键词-EN: Amodal Instance Segmentation, Amodal Instance, shape prior, Shape Prior Estimation, Shape Prior Amodal
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ACCV2024

点击查看摘要

Abstract:Amodal Instance Segmentation (AIS) presents an intriguing challenge, including the segmentation prediction of both visible and occluded parts of objects within images. Previous methods have often relied on shape prior information gleaned from training data to enhance amodal segmentation. However, these approaches are susceptible to overfitting and disregard object category details. Recent advancements highlight the potential of conditioned diffusion models, pretrained on extensive datasets, to generate images from latent space. Drawing inspiration from this, we propose AISDiff with a Diffusion Shape Prior Estimation (DiffSP) module. AISDiff begins with the prediction of the visible segmentation mask and object category, alongside occlusion-aware processing through the prediction of occluding masks. Subsequently, these elements are inputted into our DiffSP module to infer the shape prior of the object. DiffSP utilizes conditioned diffusion models pretrained on extensive datasets to extract rich visual features for shape prior estimation. Additionally, we introduce the Shape Prior Amodal Predictor, which utilizes attention-based feature maps from the shape prior to refine amodal segmentation. Experiments across various AIS benchmarks demonstrate the effectiveness of our AISDiff.

[CV-82] Spatial Visibility and Temporal Dynamics: Revolutionizing Field of View Prediction in Adaptive Point Cloud Video Streaming

链接: https://arxiv.org/abs/2409.18236
作者: Chen Li,Tongyu Zong,Yueyu Hu,Yao Wang,Yong Liu
关键词-EN: reduces bandwidth requirement, transmitting visible points, adaptive streaming significantly, streaming significantly reduces, significantly reduces bandwidth
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Field-of-View (FoV) adaptive streaming significantly reduces bandwidth requirement of immersive point cloud video (PCV) by only transmitting visible points in a viewer’s FoV. The traditional approaches often focus on trajectory-based 6 degree-of-freedom (6DoF) FoV predictions. The predicted FoV is then used to calculate point visibility. Such approaches do not explicitly consider video content’s impact on viewer attention, and the conversion from FoV to point visibility is often error-prone and time-consuming. We reformulate the PCV FoV prediction problem from the cell visibility perspective, allowing for precise decision-making regarding the transmission of 3D data at the cell level based on the predicted visibility distribution. We develop a novel spatial visibility and object-aware graph model that leverages the historical 3D visibility data and incorporates spatial perception, neighboring cell correlation, and occlusion information to predict the cell visibility in the future. Our model significantly improves the long-term cell visibility prediction, reducing the prediction MSE loss by up to 50% compared to the state-of-the-art models while maintaining real-time performance (more than 30fps) for point cloud videos with over 1 million points.

[CV-83] Visual Concept Networks: A Graph-Based Approach to Detecting Anomalous Data in Deep Neural Networks

链接: https://arxiv.org/abs/2409.18235
作者: Debargha Ganguly,Debayan Gupta,Vipin Chaudhary
关键词-EN: Deep neural networks, Deep neural, struggle with robustness, increasingly deployed, robustness against anomalous
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs), while increasingly deployed in many applications, struggle with robustness against anomalous and out-of-distribution (OOD) data. Current OOD benchmarks often oversimplify, focusing on single-object tasks and not fully representing complex real-world anomalies. This paper introduces a new, straightforward method employing graph structures and topological features to effectively detect both far-OOD and near-OOD data. We convert images into networks of interconnected human understandable features or visual concepts. Through extensive testing on two novel tasks, including ablation studies with large vocabularies and diverse tasks, we demonstrate the method’s effectiveness. This approach enhances DNN resilience to OOD data and promises improved performance in various applications.

[CV-84] Analysis of Spatial augmentation in Self-supervised models in the purview of training and test distributions ECCV2024

链接: https://arxiv.org/abs/2409.18228
作者: Abhishek Jha,Tinne Tuytelaars
关键词-EN: typical spatial augmentation, spatial augmentation techniques, representation learning methods, self-supervised representation learning, contrastive and non-contrastive
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in ECCV 2024 Workshop on Out-of-distribution generalization in computer vision (OOD-CV)

点击查看摘要

Abstract:In this paper, we present an empirical study of typical spatial augmentation techniques used in self-supervised representation learning methods (both contrastive and non-contrastive), namely random crop and cutout. Our contributions are: (a) we dissociate random cropping into two separate augmentations, overlap and patch, and provide a detailed analysis on the effect of area of overlap and patch size to the accuracy on down stream tasks. (b) We offer an insight into why cutout augmentation does not learn good representation, as reported in earlier literature. Finally, based on these analysis, © we propose a distance-based margin to the invariance loss for learning scene-centric representations for the downstream task on object-centric distribution, showing that as simple as a margin proportional to the pixel distance between the two spatial views in the scence-centric images can improve the learned representation. Our study furthers the understanding of the spatial augmentations, and the effect of the domain-gap between the training augmentations and the test distribution.

[CV-85] Learning to Drive via Asymmetric Self-Play ECCV2024

链接: https://arxiv.org/abs/2409.18218
作者: Chris Zhang,Sourav Biswas,Kelvin Wong,Kion Fallah,Lunjun Zhang,Dian Chen,Sergio Casas,Raquel Urtasun
关键词-EN: Large-scale data, crucial for learning, Large-scale, data, real data
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: ECCV 2024

点击查看摘要

Abstract:Large-scale data is crucial for learning realistic and capable driving policies. However, it can be impractical to rely on scaling datasets with real data alone. The majority of driving data is uninteresting, and deliberately collecting new long-tail scenarios is expensive and unsafe. We propose asymmetric self-play to scale beyond real data with additional challenging, solvable, and realistic synthetic scenarios. Our approach pairs a teacher that learns to generate scenarios it can solve but the student cannot, with a student that learns to solve them. When applied to traffic simulation, we learn realistic policies with significantly fewer collisions in both nominal and long-tail scenarios. Our policies further zero-shot transfer to generate training data for end-to-end autonomy, significantly outperforming state-of-the-art adversarial approaches, or using real data alone. For more information, visit this https URL .

[CV-86] Evaluation of Security of ML-based Watermarking: Copy and Removal Attacks

链接: https://arxiv.org/abs/2409.18211
作者: Vitaliy Kinakh,Brian Pulfer,Yury Belousov,Pierre Fernandez,Teddy Furon,Slava Voloshynovskiy
关键词-EN: data provenance verification, AI-generated media necessitate, digital content captured, media necessitate methods, copyright protection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The vast amounts of digital content captured from the real world or AI-generated media necessitate methods for copyright protection, traceability, or data provenance verification. Digital watermarking serves as a crucial approach to address these challenges. Its evolution spans three generations: handcrafted, autoencoder-based, and foundation model based methods. %Its evolution spans three generations: handcrafted methods, autoencoder-based schemes, and methods based on foundation models. While the robustness of these systems is well-documented, the security against adversarial attacks remains underexplored. This paper evaluates the security of foundation models’ latent space digital watermarking systems that utilize adversarial embedding techniques. A series of experiments investigate the security dimensions under copy and removal attacks, providing empirical insights into these systems’ vulnerabilities. All experimental codes and results are available at this https URLrepository

[CV-87] SSP-RACL: Classification of Noisy Fundus Images with Self-Supervised Pretraining and Robust Adaptive Credal Loss

链接: https://arxiv.org/abs/2409.18147
作者: Mengwen Ye,Yingzi Huangfu,You Li,Zekuan Yu
关键词-EN: aided diagnosis tasks, deep neural networks, computer aided diagnosis, fundus image datasets, Fundus image classification
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: IEEE BioCAS 2024

点击查看摘要

Abstract:Fundus image classification is crucial in the computer aided diagnosis tasks, but label noise significantly impairs the performance of deep neural networks. To address this challenge, we propose a robust framework, Self-Supervised Pre-training with Robust Adaptive Credal Loss (SSP-RACL), for handling label noise in fundus image datasets. First, we use Masked Autoencoders (MAE) for pre-training to extract features, unaffected by label noise. Subsequently, RACL employ a superset learning framework, setting confidence thresholds and adaptive label relaxation parameter to construct possibility distributions and provide more reliable ground-truth estimates, thus effectively suppressing the memorization effect. Additionally, we introduce clinical knowledge-based asymmetric noise generation to simulate real-world noisy fundus image datasets. Experimental results demonstrate that our proposed method outperforms existing approaches in handling label noise, showing superior performance.

[CV-88] Simulating Dynamic Tumor Contrast Enhancement in Breast MRI using Conditional Generative Adversarial Networks

链接: https://arxiv.org/abs/2409.18872
作者: Richard Osuala,Smriti Joshi,Apostolia Tsirikoglou,Lidia Garrucho,Walter H.L. Pinaya,Daniel M. Lang,Julia A. Schnabel,Oliver Diaz,Karim Lekadir
关键词-EN: agent-based DCE-MRI acquisition, traditional contrast agent-based, promising non-invasive alternative, Scaled Aggregate Measure, paper presents
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a method for virtual contrast enhancement in breast MRI, offering a promising non-invasive alternative to traditional contrast agent-based DCE-MRI acquisition. Using a conditional generative adversarial network, we predict DCE-MRI images, including jointly-generated sequences of multiple corresponding DCE-MRI timepoints, from non-contrast-enhanced MRIs, enabling tumor localization and characterization without the associated health risks. Furthermore, we qualitatively and quantitatively evaluate the synthetic DCE-MRI images, proposing a multi-metric Scaled Aggregate Measure (SAMe), assessing their utility in a tumor segmentation downstream task, and conclude with an analysis of the temporal patterns in multi-sequence DCE-MRI generation. Our approach demonstrates promising results in generating realistic and useful DCE-MRI sequences, highlighting the potential of virtual contrast enhancement for improving breast cancer diagnosis and treatment, particularly for patients where contrast agent administration is contraindicated.

[CV-89] Positional Encoder Graph Quantile Neural Networks for Geographic Data

链接: https://arxiv.org/abs/2409.18865
作者: William E. R. de Amorim,Scott A. Sisson,T. Rodrigues,David J. Nott,Guilherme S. Rodrigues
关键词-EN: Positional Encoder Graph, Encoder Graph Neural, Graph Neural Networks, Graph Quantile Neural, Encoder Graph Quantile
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 17 main text pages, 4 figures

点击查看摘要

Abstract:Positional Encoder Graph Neural Networks (PE-GNNs) are a leading approach for modeling continuous spatial data. However, they often fail to produce calibrated predictive distributions, limiting their effectiveness for uncertainty quantification. We introduce the Positional Encoder Graph Quantile Neural Network (PE-GQNN), a novel method that integrates PE-GNNs, Quantile Neural Networks, and recalibration techniques in a fully nonparametric framework, requiring minimal assumptions about the predictive distributions. We propose a new network architecture that, when combined with a quantile-based loss function, yields accurate and reliable probabilistic models without increasing computational complexity. Our approach provides a flexible, robust framework for conditional density estimation, applicable beyond spatial data contexts. We further introduce a structured method for incorporating a KNN predictor into the model while avoiding data leakage through the GNN layer operation. Experiments on benchmark datasets demonstrate that PE-GQNN significantly outperforms existing state-of-the-art methods in both predictive accuracy and uncertainty quantification.

[CV-90] Early diagnosis of Alzheimers disease from MRI images with deep learning model

链接: https://arxiv.org/abs/2409.18814
作者: Sajjad Aghasi Javid,Mahmood Mohassel Feghhi
关键词-EN: worldwide is Alzheimer, Alzheimer disease, Minority Oversampling Technique, Alzheimer, Synthetic Minority Oversampling
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 7 pages, 3 figures, Presented at the 20-th CSI International Symposium on Artificial Intelligence and Signal Processing (AISP) 21-22 February, 2024, Mazandaran University of Science and Technology, Babol, Iran

点击查看摘要

Abstract:It is acknowledged that the most common cause of dementia worldwide is Alzheimer’s disease (AD). This condition progresses in severity from mild to severe and interferes with people’s everyday routines. Early diagnosis plays a critical role in patient care and clinical trials. Convolutional neural networks (CNN) are used to create a framework for identifying specific disease features from MRI scans Classification of dementia involves approaches such as medical history review, neuropsychological tests, and magnetic resonance imaging (MRI). However, the image dataset obtained from Kaggle faces a significant issue of class imbalance, which requires equal distribution of samples from each class to address. In this article, to address this imbalance, the Synthetic Minority Oversampling Technique (SMOTE) is utilized. Furthermore, a pre-trained convolutional neural network has been applied to the DEMNET dementia network to extract key features from AD images. The proposed model achieved an impressive accuracy of 98.67%.

[CV-91] DualDn: Dual-domain Denoising via Differentiable ISP ECCV2024

链接: https://arxiv.org/abs/2409.18783
作者: Ruikang Li,Yujin Wang,Shiqi Chen,Fan Zhang,Jinwei Gu,Tianfan Xue
关键词-EN: Image Signal Processing, camera Image Signal, Image Signal, ISP, Signal Processing
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV 2024, Project page: this https URL

点击查看摘要

Abstract:Image denoising is a critical component in a camera’s Image Signal Processing (ISP) pipeline. There are two typical ways to inject a denoiser into the ISP pipeline: applying a denoiser directly to captured raw frames (raw domain) or to the ISP’s output sRGB images (sRGB domain). However, both approaches have their limitations. Residual noise from raw-domain denoising can be amplified by the subsequent ISP processing, and the sRGB domain struggles to handle spatially varying noise since it only sees noise distorted by the ISP. Consequently, most raw or sRGB domain denoising works only for specific noise distributions and ISP configurations. To address these challenges, we propose DualDn, a novel learning-based dual-domain denoising. Unlike previous single-domain denoising, DualDn consists of two denoising networks: one in the raw domain and one in the sRGB domain. The raw domain denoising adapts to sensor-specific noise as well as spatially varying noise levels, while the sRGB domain denoising adapts to ISP variations and removes residual noise amplified by the ISP. Both denoising networks are connected with a differentiable ISP, which is trained end-to-end and discarded during the inference stage. With this design, DualDn achieves greater generalizability compared to most learning-based denoising methods, as it can adapt to different unseen noises, ISP parameters, and even novel ISP pipelines. Experiments show that DualDn achieves state-of-the-art performance and can adapt to different denoising architectures. Moreover, DualDn can be used as a plug-and-play denoising module with real cameras without retraining, and still demonstrate better performance than commercial on-camera denoising. The project website is available at: this https URL

[CV-92] A Generalized Tensor Formulation for Hyperspectral Image Super-Resolution Under General Spatial Blurring

链接: https://arxiv.org/abs/2409.18731
作者: Yinjian Wang,Wei Li,Yuanyuan Gui,Qian Du,James E. Fowler
关键词-EN: low spatial resolution, high spatial resolution, spatial resolution, desired super-resolved image, observed hyperspectral image
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Hyperspectral super-resolution is commonly accomplished by the fusing of a hyperspectral imaging of low spatial resolution with a multispectral image of high spatial resolution, and many tensor-based approaches to this task have been recently proposed. Yet, it is assumed in such tensor-based methods that the spatial-blurring operation that creates the observed hyperspectral image from the desired super-resolved image is separable into independent horizontal and vertical blurring. Recent work has argued that such separable spatial degradation is ill-equipped to model the operation of real sensors which may exhibit, for example, anisotropic blurring. To accommodate this fact, a generalized tensor formulation based on a Kronecker decomposition is proposed to handle any general spatial-degradation matrix, including those that are not separable as previously assumed. Analysis of the generalized formulation reveals conditions under which exact recovery of the desired super-resolved image is guaranteed, and a practical algorithm for such recovery, driven by a blockwise-group-sparsity regularization, is proposed. Extensive experimental results demonstrate that the proposed generalized tensor approach outperforms not only traditional matrix-based techniques but also state-of-the-art tensor-based methods; the gains with respect to the latter are especially significant in cases of anisotropic spatial blurring.

[CV-93] Effectiveness of learning-based image codecs on fingerprint storage

链接: https://arxiv.org/abs/2409.18730
作者: Daniele Mari,Saverio Cavasin,Simone Milani,Mauro Conti
关键词-EN: learning-based coding techniques, success of learning-based, development of learning-based, learning-based image coding, learning-based image
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted ad Wifs 2024

点击查看摘要

Abstract:The success of learning-based coding techniques and the development of learning-based image coding standards, such as JPEG-AI, point towards the adoption of such solutions in different fields, including the storage of biometric data, like fingerprints. However, the peculiar nature of learning-based compression artifacts poses several issues concerning their impact and effectiveness on extracting biometric features and landmarks, e.g., minutiae. This problem is utterly stressed by the fact that most models are trained on natural color images, whose characteristics are very different from usual biometric images, e.g, fingerprint or iris pictures. As a matter of fact, these issues are deemed to be accurately questioned and investigated, being such analysis still largely unexplored. This study represents the first investigation about the adaptability of learning-based image codecs in the storage of fingerprint images by measuring its impact on the extraction and characterization of minutiae. Experimental results show that at a fixed rate point, learned solutions considerably outperform previous fingerprint coding standards, like JPEG2000, both in terms of distortion and minutiae preservation. Indeed, experimental results prove that the peculiarities of learned compression artifacts do not prevent automatic fingerprint identification (since minutiae types and locations are not significantly altered), nor do compromise image quality for human visual inspection (as they gain in terms of BD rate and PSNR of 47.8% and +3.97dB respectively). Comments: Accepted ad Wifs 2024 Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.18730 [eess.IV] (or arXiv:2409.18730v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2409.18730 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-94] Multi-modal Medical Image Fusion For Non-Small Cell Lung Cancer Classification

链接: https://arxiv.org/abs/2409.18715
作者: Salma Hassan,Hamad Al Hammadi,Ibrahim Mohammed,Muhammad Haris Khan
关键词-EN: cancer mortality worldwide, nuanced subtype classification, non-small cell lung, mortality worldwide, complex issue
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The early detection and nuanced subtype classification of non-small cell lung cancer (NSCLC), a predominant cause of cancer mortality worldwide, is a critical and complex issue. In this paper, we introduce an innovative integration of multi-modal data, synthesizing fused medical imaging (CT and PET scans) with clinical health records and genomic data. This unique fusion methodology leverages advanced machine learning models, notably MedClip and BEiT, for sophisticated image feature extraction, setting a new standard in computational oncology. Our research surpasses existing approaches, as evidenced by a substantial enhancement in NSCLC detection and classification precision. The results showcase notable improvements across key performance metrics, including accuracy, precision, recall, and F1-score. Specifically, our leading multi-modal classifier model records an impressive accuracy of 94.04%. We believe that our approach has the potential to transform NSCLC diagnostics, facilitating earlier detection and more effective treatment planning and, ultimately, leading to superior patient outcomes in lung cancer care.

[CV-95] 3DPX: Single Panoramic X-ray Analysis Guided by 3D Oral Structure Reconstruction

链接: https://arxiv.org/abs/2409.18701
作者: Xiaoshuang Li,Zimo Huang,Mingyuan Meng,Eduardo Delamare,Dagan Feng,Lei Bi,Bin Sheng,Lingyong Jiang,Bo Li,Jinman Kim
关键词-EN: Panoramic X-ray, dentistry practice owing, low cost, prevalent modality, modality in dentistry
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Panoramic X-ray (PX) is a prevalent modality in dentistry practice owing to its wide availability and low cost. However, as a 2D projection of a 3D structure, PX suffers from anatomical information loss and PX diagnosis is limited compared to that with 3D imaging modalities. 2D-to-3D reconstruction methods have been explored for the ability to synthesize the absent 3D anatomical information from 2D PX for use in PX image analysis. However, there are challenges in leveraging such 3D synthesized reconstructions. First, inferring 3D depth from 2D images remains a challenging task with limited accuracy. The second challenge is the joint analysis of 2D PX with its 3D synthesized counterpart, with the aim to maximize the 2D-3D synergy while minimizing the errors arising from the synthesized image. In this study, we propose a new method termed 3DPX - PX image analysis guided by 2D-to-3D reconstruction, to overcome these challenges. 3DPX consists of (i) a novel progressive reconstruction network to improve 2D-to-3D reconstruction and, (ii) a contrastive-guided bidirectional multimodality alignment module for 3D-guided 2D PX classification and segmentation tasks. The reconstruction network progressively reconstructs 3D images with knowledge imposed on the intermediate reconstructions at multiple pyramid levels and incorporates Multilayer Perceptrons to improve semantic understanding. The downstream networks leverage the reconstructed images as 3D anatomical guidance to the PX analysis through feature alignment, which increases the 2D-3D synergy with bidirectional feature projection and decease the impact of potential errors with contrastive guidance. Extensive experiments on two oral datasets involving 464 studies demonstrate that 3DPX outperforms the state-of-the-art methods in various tasks including 2D-to-3D reconstruction, PX classification and lesion segmentation.

[CV-96] owards Integrating Epistemic Uncertainty Estimation into the Radiotherapy Workflow

链接: https://arxiv.org/abs/2409.18628
作者: Marvin Tom Teichmann,Manasi Datar,Lisa Kratzke,Fernando Vega,Florin C. Ghesu
关键词-EN: contouring target structures, ensuring treatment efficacy, epistemic uncertainty estimation, uncertainty estimation, OOD detection
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Keywords: Epistemic Uncertainty - Out-of-Distribution Detection - CT Segmentation - OAR contouring - Radiotherapy

点击查看摘要

Abstract:The precision of contouring target structures and organs-at-risk (OAR) in radiotherapy planning is crucial for ensuring treatment efficacy and patient safety. Recent advancements in deep learning (DL) have significantly improved OAR contouring performance, yet the reliability of these models, especially in the presence of out-of-distribution (OOD) scenarios, remains a concern in clinical settings. This application study explores the integration of epistemic uncertainty estimation within the OAR contouring workflow to enable OOD detection in clinically relevant scenarios, using specifically compiled data. Furthermore, we introduce an advanced statistical method for OOD detection to enhance the methodological framework of uncertainty estimation. Our empirical evaluation demonstrates that epistemic uncertainty estimation is effective in identifying instances where model predictions are unreliable and may require an expert review. Notably, our approach achieves an AUC-ROC of 0.95 for OOD detection, with a specificity of 0.95 and a sensitivity of 0.92 for implant cases, underscoring its efficacy. This study addresses significant gaps in the current research landscape, such as the lack of ground truth for uncertainty estimation and limited empirical evaluations. Additionally, it provides a clinically relevant application of epistemic uncertainty estimation in an FDA-approved and widely used clinical solution for OAR segmentation from Varian, a Siemens Healthineers company, highlighting its practical benefits.

[CV-97] Metasurface-generated large and arbitrary analog convolution kernels for accelerated machine vision

链接: https://arxiv.org/abs/2409.18614
作者: Ruiqi Liang,Shuai Wang,Yiying Dong,Liu Li,Ying Kuang,Bohan Zhang,Yuanmu Yang
关键词-EN: tackling complex challenges, convolutional neural networks, machine vision tasks, rapidly evolving field, machine vision
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the rapidly evolving field of artificial intelligence, convolutional neural networks are essential for tackling complex challenges such as machine vision and medical diagnosis. Recently, to address the challenges in processing speed and power consumption of conventional digital convolution operations, many optical components have been suggested to replace the digital convolution layer in the neural network, accelerating various machine vision tasks. Nonetheless, the analog nature of the optical convolution kernel has not been fully explored. Here, we develop a spatial frequency domain training method to create arbitrarily shaped analog convolution kernels using an optical metasurface as the convolution layer, with its receptive field largely surpassing digital convolution kernels. By employing spatial multiplexing, the multiple parallel convolution kernels with both positive and negative weights are generated under the incoherent illumination condition. We experimentally demonstrate a 98.59% classification accuracy on the MNIST dataset, with simulations showing 92.63% and 68.67% accuracy on the Fashion-MNIST and CIFAR-10 datasets with additional digital layers. This work underscores the unique advantage of analog optical convolution, offering a promising avenue to accelerate machine vision tasks, especially in edge devices.

[CV-98] Med-IC: Fusing a Single Layer Involution with Convolutions for Enhanced Medical Image Classification and Segmentation

链接: https://arxiv.org/abs/2409.18506
作者: Md. Farhadul Islam,Sarah Zabeen,Meem Arafat Manab,Mohammad Rakibul Hasan Mahin,Joyanta Jyoti Mondal,Md. Tanzim Reza,Md Zahidul Hasan,Munima Haque,Farig Sadeque,Jannatun Noor
关键词-EN: similar characteristics, resemble cells, majority of medical, medical images, cell region
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 13 pages, 5 figures, 4 tables, preprint submitted to an Elsevier journal

点击查看摘要

Abstract:The majority of medical images, especially those that resemble cells, have similar characteristics. These images, which occur in a variety of shapes, often show abnormalities in the organ or cell region. The convolution operation possesses a restricted capability to extract visual patterns across several spatial regions of an image. The involution process, which is the inverse operation of convolution, complements this inherent lack of spatial information extraction present in convolutions. In this study, we investigate how applying a single layer of involution prior to a convolutional neural network (CNN) architecture can significantly improve classification and segmentation performance, with a comparatively negligible amount of weight parameters. The study additionally shows how excessive use of involution layers might result in inaccurate predictions in a particular type of medical image. According to our findings from experiments, the strategy of adding only a single involution layer before a CNN-based model outperforms most of the previous works.

[CV-99] DRL-STNet: Unsupervised Domain Adaptation for Cross-modality Medical Image Segmentation via Disentangled Representation Learning MICCAI2024

链接: https://arxiv.org/abs/2409.18340
作者: Hui Lin,Florian Schiffers,Santiago López-Tapia,Neda Tavakoli,Daniel Kim,Aggelos K. Katsaggelos
关键词-EN: Unsupervised domain adaptation, cross-modality data scenarios, Unsupervised domain, medical image segmentation, cross-modality medical image
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: MICCAI 2024 Challenge, FLARE Challenge, Unsupervised domain adaptation, Organ segmentation, Feature disentanglement, Self-training

点击查看摘要

Abstract:Unsupervised domain adaptation (UDA) is essential for medical image segmentation, especially in cross-modality data scenarios. UDA aims to transfer knowledge from a labeled source domain to an unlabeled target domain, thereby reducing the dependency on extensive manual annotations. This paper presents DRL-STNet, a novel framework for cross-modality medical image segmentation that leverages generative adversarial networks (GANs), disentangled representation learning (DRL), and self-training (ST). Our method leverages DRL within a GAN to translate images from the source to the target modality. Then, the segmentation model is initially trained with these translated images and corresponding source labels and then fine-tuned iteratively using a combination of synthetic and real images with pseudo-labels and real labels. The proposed framework exhibits superior performance in abdominal organ segmentation on the FLARE challenge dataset, surpassing state-of-the-art methods by 11.4% in the Dice similarity coefficient and by 13.1% in the Normalized Surface Dice metric, achieving scores of 74.21% and 80.69%, respectively. The average running time is 41 seconds, and the area under the GPU memory-time curve is 11,292 MB. These results indicate the potential of DRL-STNet for enhancing cross-modality medical image segmentation tasks.

[CV-100] Photon Inhibition for Energy-Efficient Single-Photon Imaging ECCV2024

链接: https://arxiv.org/abs/2409.18337
作者: Lucas J. Koerner,Shantanu Gupta,Atul Ingle,Mohit Gupta
关键词-EN: challenging imaging applications, Single-photon, photon, single-photon avalanche diode, CMOS camera
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Detectors (physics.ins-det)
*备注: Accepted for ECCV 2024. Supplementary material and code available at this https URL

点击查看摘要

Abstract:Single-photon cameras (SPCs) are emerging as sensors of choice for various challenging imaging applications. One class of SPCs based on the single-photon avalanche diode (SPAD) detects individual photons using an avalanche process; the raw photon data can then be processed to extract scene information under extremely low light, high dynamic range, and rapid motion. Yet, single-photon sensitivity in SPADs comes at a cost – each photon detection consumes more energy than that of a CMOS camera. This avalanche power significantly limits sensor resolution and could restrict widespread adoption of SPAD-based SPCs. We propose a computational-imaging approach called \emphphoton inhibition to address this challenge. Photon inhibition strategically allocates detections in space and time based on downstream inference task goals and resource constraints. We develop lightweight, on-sensor computational inhibition policies that use past photon data to disable SPAD pixels in real-time, to select the most informative future photons. As case studies, we design policies tailored for image reconstruction and edge detection, and demonstrate, both via simulations and real SPC captured data, considerable reduction in photon detections (over 90% of photons) while maintaining task performance metrics. Our work raises the question of ``which photons should be detected?‘’, and paves the way for future energy-efficient single-photon imaging.

[CV-101] Synthesizing beta-amyloid PET images from T1-weighted Structural MRI: A Preliminary Study

链接: https://arxiv.org/abs/2409.18282
作者: Qing Lyu,Jin Young Kim,Jeongchul Kim,Christopher T Whitlow
关键词-EN: Beta-amyloid positron emission, positron emission tomography, Alzheimer disease, Beta-amyloid positron, tool in Alzheimer
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
*备注:

点击查看摘要

Abstract:Beta-amyloid positron emission tomography (A \beta -PET) imaging has become a critical tool in Alzheimer’s disease (AD) research and diagnosis, providing insights into the pathological accumulation of amyloid plaques, one of the hallmarks of AD. However, the high cost, limited availability, and exposure to radioactivity restrict the widespread use of A \beta -PET imaging, leading to a scarcity of comprehensive datasets. Previous studies have suggested that structural magnetic resonance imaging (MRI), which is more readily available, may serve as a viable alternative for synthesizing A \beta -PET images. In this study, we propose an approach to utilize 3D diffusion models to synthesize A \beta -PET images from T1-weighted MRI scans, aiming to overcome the limitations associated with direct PET imaging. Our method generates high-quality A \beta -PET images for cognitive normal cases, although it is less effective for mild cognitive impairment (MCI) patients due to the variability in A \beta deposition patterns among subjects. Our preliminary results suggest that incorporating additional data, such as a larger sample of MCI cases and multi-modality information including clinical and demographic details, cognitive and functional assessments, and longitudinal data, may be necessary to improve A \beta -PET image synthesis for MCI patients.

[CV-102] Developing a Dual-Stage Vision Transformer Model for Lung Disease Classification

链接: https://arxiv.org/abs/2409.18257
作者: Anirudh Mazumder,Jianguo Liu
关键词-EN: United States, million people, Lung diseases, prevalent problem, Artificial Intelligence
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 3 pages, 3 figures, Applied to the IEEE MetroCon 2024 Conference

点击查看摘要

Abstract:Lung diseases have become a prevalent problem throughout the United States, affecting over 34 million people. Accurate and timely diagnosis of the different types of lung diseases is critical, and Artificial Intelligence (AI) methods could speed up these processes. A dual-stage vision transformer is built throughout this research by integrating a Vision Transformer (ViT) and a Swin Transformer to classify 14 different lung diseases from X-ray scans of patients with these diseases. The proposed model achieved an accuracy of 92.06% when making predictions on an unseen testing subset of the dataset after data preprocessing and training the neural network. The model showed promise for accurately classifying lung diseases and diagnosing patients who suffer from these harmful diseases.

[CV-103] PNR: Physics-informed Neural Representation for high-resolution LFM reconstruction

链接: https://arxiv.org/abs/2409.18223
作者: Jiayin Zhao,Zhifeng Zhao,Jiamin Wu,Tao Yu,Hui Qiao
关键词-EN: Light field microscopy, Light field, efficiently capture high-resolution, field microscopy, widely utilized
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Light field microscopy (LFM) has been widely utilized in various fields for its capability to efficiently capture high-resolution 3D scenes. Despite the rapid advancements in neural representations, there are few methods specifically tailored for microscopic scenes. Existing approaches often do not adequately address issues such as the loss of high-frequency information due to defocus and sample aberration, resulting in suboptimal performance. In addition, existing methods, including RLD, INR, and supervised U-Net, face challenges such as sensitivity to initial estimates, reliance on extensive labeled data, and low computational efficiency, all of which significantly diminish the practicality in complex biological scenarios. This paper introduces PNR (Physics-informed Neural Representation), a method for high-resolution LFM reconstruction that significantly enhances performance. Our method incorporates an unsupervised and explicit feature representation approach, resulting in a 6.1 dB improvement in PSNR than RLD. Additionally, our method employs a frequency-based training loss, enabling better recovery of high-frequency details, which leads to a reduction in LPIPS by at least half compared to SOTA methods (1.762 V.S. 3.646 of DINER). Moreover, PNR integrates a physics-informed aberration correction strategy that optimizes Zernike polynomial parameters during optimization, thereby reducing the information loss caused by aberrations and improving spatial resolution. These advancements make PNR a promising solution for long-term high-resolution biological imaging applications. Our code and dataset will be made publicly available.

[CV-104] oward Efficient Deep Blind RAW Image Restoration ICIP

链接: https://arxiv.org/abs/2409.18204
作者: Marcos V. Conde,Florin Vasluianu,Radu Timofte
关键词-EN: depart from RGB, Image Signal Processor, Multiple low-vision tasks, RGB images, improving the quality
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: IEEE International Conference on Image Processing (ICIP) 2024. arXiv admin note: text overlap with arXiv:2312.15487

点击查看摘要

Abstract:Multiple low-vision tasks such as denoising, deblurring and super-resolution depart from RGB images and further reduce the degradations, improving the quality. However, modeling the degradations in the sRGB domain is complicated because of the Image Signal Processor (ISP) transformations. Despite of this known issue, very few methods in the literature work directly with sensor RAW images. In this work we tackle image restoration directly in the RAW domain. We design a new realistic degradation pipeline for training deep blind RAW restoration models. Our pipeline considers realistic sensor noise, motion blur, camera shake, and other common degradations. The models trained with our pipeline and data from multiple sensors, can successfully reduce noise and blur, and recover details in RAW images captured from different cameras. To the best of our knowledge, this is the most exhaustive analysis on RAW image restoration. Code available at this https URL

机器学习

[LG-0] PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation ECCV2024

链接: https://arxiv.org/abs/2409.18964
作者: Shaowei Liu,Zhongzheng Ren,Saurabh Gupta,Shenlong Wang
关键词-EN: temporally consistent video, input condition, force and torque, method that converts, converts a single
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to ECCV 2024. Project page: this https URL

点击查看摘要

Abstract:We present PhysGen, a novel image-to-video generation method that converts a single image and an input condition (e.g., force and torque applied to an object in the image) to produce a realistic, physically plausible, and temporally consistent video. Our key insight is to integrate model-based physical simulation with a data-driven video generation process, enabling plausible image-space dynamics. At the heart of our system are three core components: (i) an image understanding module that effectively captures the geometry, materials, and physical parameters of the image; (ii) an image-space dynamics simulation model that utilizes rigid-body physics and inferred parameters to simulate realistic behaviors; and (iii) an image-based rendering and refinement module that leverages generative video diffusion to produce realistic video footage featuring the simulated motion. The resulting videos are realistic in both physics and appearance and are even precisely controllable, showcasing superior results over existing data-driven image-to-video generation works through quantitative comparison and comprehensive user study. PhysGen’s resulting videos can be used for various downstream applications, such as turning an image into a realistic animation or allowing users to interact with the image and create various dynamics. Project page: this https URL

[LG-1] Exploring Token Pruning in Vision State Space Models NEURIPS’24

链接: https://arxiv.org/abs/2409.18962
作者: Zheng Zhan,Zhenglun Kong,Yifan Gong,Yushu Wu,Zichong Meng,Hangyu Zheng,Xuan Shen,Stratis Ioannidis,Wei Niu,Pu Zhao,Yanzhi Wang
关键词-EN: State Space Models, powerful vision foundation, keeping linear computational, linear computational complexity, computational complexity compared
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: NeurIPS’24

点击查看摘要

Abstract:State Space Models (SSMs) have the advantage of keeping linear computational complexity compared to attention modules in transformers, and have been applied to vision tasks as a new type of powerful vision foundation model. Inspired by the observations that the final prediction in vision transformers (ViTs) is only based on a subset of most informative tokens, we take the novel step of enhancing the efficiency of SSM-based vision models through token-based pruning. However, direct applications of existing token pruning techniques designed for ViTs fail to deliver good performance, even with extensive fine-tuning. To address this issue, we revisit the unique computational characteristics of SSMs and discover that naive application disrupts the sequential token positions. This insight motivates us to design a novel and general token pruning method specifically for SSM-based vision models. We first introduce a pruning-aware hidden state alignment method to stabilize the neighborhood of remaining tokens for performance enhancement. Besides, based on our detailed analysis, we propose a token importance evaluation method adapted for SSM models, to guide the token pruning. With efficient implementation and practical acceleration methods, our method brings actual speedup. Extensive experiments demonstrate that our approach can achieve significant computation reduction with minimal impact on performance across different tasks. Notably, we achieve 81.7% accuracy on ImageNet with a 41.6% reduction in the FLOPs for pruned PlainMamba-L3. Furthermore, our work provides deeper insights into understanding the behavior of SSM-based vision models for future research.

[LG-2] O(d/T) Convergence Theory for Diffusion Probabilistic Models under Minimal Assumptions

链接: https://arxiv.org/abs/2409.18959
作者: Gen Li,Yuling Yan
关键词-EN: Score-based diffusion models, Score-based diffusion, achieved remarkable success, diffusion models, generative tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Score-based diffusion models, which generate new data by learning to reverse a diffusion process that perturbs data from the target distribution into noise, have achieved remarkable success across various generative tasks. Despite their superior empirical performance, existing theoretical guarantees are often constrained by stringent assumptions or suboptimal convergence rates. In this paper, we establish a fast convergence theory for a popular SDE-based sampler under minimal assumptions. Our analysis shows that, provided \ell_2 -accurate estimates of the score functions, the total variation distance between the target and generated distributions is upper bounded by O(d/T) (ignoring logarithmic factors), where d is the data dimensionality and T is the number of steps. This result holds for any target distribution with finite first-order moment. To our knowledge, this improves upon existing convergence theory for both the SDE-based sampler and another ODE-based sampler, while imposing minimal assumptions on the target data distribution and score estimates. This is achieved through a novel set of analytical tools that provides a fine-grained characterization of how the error propagates at each step of the reverse process.

[LG-3] LML: Language Model Learning a Dataset for Data-Augmented Prediction

链接: https://arxiv.org/abs/2409.18957
作者: Praneeth Vadlapati
关键词-EN: Large Language Models, Language Model Learning, Large Language, Machine Learning Model, Explainable Machine Learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: First version

点击查看摘要

Abstract:This paper introduces a new approach to using Large Language Models (LLMs) for classification tasks, which are typically handled using Machine Learning (ML) models. Unlike ML models that rely heavily on data cleaning and feature engineering, this method streamlines the process using LLMs. This paper proposes a new concept called “Language Model Learning (LML)” powered by a new method called “Data-Augmented Prediction (DAP)”. The classification is performed by LLMs using a method similar to humans manually exploring and understanding the data and deciding classifications using data as a reference. Training data is summarized and evaluated to determine the features that lead to the classification of each label the most. In the process of DAP, the system uses the data summary to automatically create a query, which is used to retrieve relevant rows from the dataset. A classification is generated by the LLM using data summary and relevant rows, ensuring satisfactory accuracy even with complex data. Usage of data summary and similar data in DAP ensures context-aware decision-making. The proposed method uses the words “Act as an Explainable Machine Learning Model” in the prompt to enhance the interpretability of the predictions by allowing users to review the logic behind each prediction. In some test cases, the system scored an accuracy above 90%, proving the effectiveness of the system and its potential to outperform conventional ML models in various scenarios. The code is available at this https URL

[LG-4] RepairBench: Leaderboard of Frontier Models for Program Repair

链接: https://arxiv.org/abs/2409.18952
作者: André Silva,Martin Monperrus
关键词-EN: repair buggy software, program repair, buggy software, software by producing, AI-driven program repair
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:AI-driven program repair uses AI models to repair buggy software by producing patches. Rapid advancements in AI surely impact state-of-the-art performance of program repair. Yet, grasping this progress requires frequent and standardized evaluations. We propose RepairBench, a novel leaderboard for AI-driven program repair. The key characteristics of RepairBench are: 1) it is execution-based: all patches are compiled and executed against a test suite, 2) it assesses frontier models in a frequent and standardized way. RepairBench leverages two high-quality benchmarks, Defects4J and GitBug-Java, to evaluate frontier models against real-world program repair tasks. We publicly release the evaluation framework of RepairBench. We will update the leaderboard as new frontier models are released.

[LG-5] Spectral Wavelet Dropout: Regularization in the Wavelet Domain ICML

链接: https://arxiv.org/abs/2409.18951
作者: Rinor Cakaj,Jens Mehnert,Bin Yang
关键词-EN: convolutional neural networks, techniques help prevent, ability of convolutional, convolutional neural, Spectral Wavelet Dropout
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted by The International Conference on Machine Learning and Applications (ICMLA) 2024

点击查看摘要

Abstract:Regularization techniques help prevent overfitting and therefore improve the ability of convolutional neural networks (CNNs) to generalize. One reason for overfitting is the complex co-adaptations among different parts of the network, which make the CNN dependent on their joint response rather than encouraging each part to learn a useful feature representation independently. Frequency domain manipulation is a powerful strategy for modifying data that has temporal and spatial coherence by utilizing frequency decomposition. This work introduces Spectral Wavelet Dropout (SWD), a novel regularization method that includes two variants: 1D-SWD and 2D-SWD. These variants improve CNN generalization by randomly dropping detailed frequency bands in the discrete wavelet decomposition of feature maps. Our approach distinguishes itself from the pre-existing Spectral “Fourier” Dropout (2D-SFD), which eliminates coefficients in the Fourier domain. Notably, SWD requires only a single hyperparameter, unlike the two required by SFD. We also extend the literature by implementing a one-dimensional version of Spectral “Fourier” Dropout (1D-SFD), setting the stage for a comprehensive comparison. Our evaluation shows that both 1D and 2D SWD variants have competitive performance on CIFAR-10/100 benchmarks relative to both 1D-SFD and 2D-SFD. Specifically, 1D-SWD has a significantly lower computational complexity compared to 1D/2D-SFD. In the Pascal VOC Object Detection benchmark, SWD variants surpass 1D-SFD and 2D-SFD in performance and demonstrate lower computational complexity during training.

[LG-6] A-FedPD: Aligning Dual-Drift is All Federated Primal-Dual Learning Needs

链接: https://arxiv.org/abs/2409.18915
作者: Yan Sun,Li Shen,Dacheng Tao
关键词-EN: juggling data privacy, federated primal dual, popular paradigm, paradigm for juggling, juggling data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As a popular paradigm for juggling data privacy and collaborative training, federated learning (FL) is flourishing to distributively process the large scale of heterogeneous datasets on edged clients. Due to bandwidth limitations and security considerations, it ingeniously splits the original problem into multiple subproblems to be solved in parallel, which empowers primal dual solutions to great application values in FL. In this paper, we review the recent development of classical federated primal dual methods and point out a serious common defect of such methods in non-convex scenarios, which we say is a “dual drift” caused by dual hysteresis of those longstanding inactive clients under partial participation training. To further address this problem, we propose a novel Aligned Federated Primal Dual (A-FedPD) method, which constructs virtual dual updates to align global consensus and local dual variables for those protracted unparticipated local clients. Meanwhile, we provide a comprehensive analysis of the optimization and generalization efficiency for the A-FedPD method on smooth non-convex objectives, which confirms its high efficiency and practicality. Extensive experiments are conducted on several classical FL setups to validate the effectiveness of our proposed method.

[LG-7] Best Arm Identification with Minimal Regret

链接: https://arxiv.org/abs/2409.18909
作者: Junwen Yang,Vincent Y. F. Tan,Tianyuan Jin
关键词-EN: necessitate responsible experimentation, Motivated by real-world, responsible experimentation, real-world applications, applications that necessitate
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注: Preprint

点击查看摘要

Abstract:Motivated by real-world applications that necessitate responsible experimentation, we introduce the problem of best arm identification (BAI) with minimal regret. This innovative variant of the multi-armed bandit problem elegantly amalgamates two of its most ubiquitous objectives: regret minimization and BAI. More precisely, the agent’s goal is to identify the best arm with a prescribed confidence level \delta , while minimizing the cumulative regret up to the stopping time. Focusing on single-parameter exponential families of distributions, we leverage information-theoretic techniques to establish an instance-dependent lower bound on the expected cumulative regret. Moreover, we present an intriguing impossibility result that underscores the tension between cumulative regret and sample complexity in fixed-confidence BAI. Complementarily, we design and analyze the Double KL-UCB algorithm, which achieves asymptotic optimality as the confidence level tends to zero. Notably, this algorithm employs two distinct confidence bounds to guide arm selection in a randomized manner. Our findings elucidate a fresh perspective on the inherent connections between regret minimization and BAI.

[LG-8] In-depth Analysis of Privacy Threats in Federated Learning for Medical Data

链接: https://arxiv.org/abs/2409.18907
作者: Badhan Chandra Das,M. Hadi Amini,Yanzhao Wu
关键词-EN: Federated learning, safeguard sensitive patient, promising machine learning, machine learning technique, medical
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning is emerging as a promising machine learning technique in the medical field for analyzing medical images, as it is considered an effective method to safeguard sensitive patient data and comply with privacy regulations. However, recent studies have revealed that the default settings of federated learning may inadvertently expose private training data to privacy attacks. Thus, the intensity of such privacy risks and potential mitigation strategies in the medical domain remain unclear. In this paper, we make three original contributions to privacy risk analysis and mitigation in federated learning for medical data. First, we propose a holistic framework, MedPFL, for analyzing privacy risks in processing medical data in the federated learning environment and developing effective mitigation strategies for protecting privacy. Second, through our empirical analysis, we demonstrate the severe privacy risks in federated learning to process medical images, where adversaries can accurately reconstruct private medical images by performing privacy attacks. Third, we illustrate that the prevalent defense mechanism of adding random noises may not always be effective in protecting medical images against privacy attacks in federated learning, which poses unique and pressing challenges related to protecting the privacy of medical data. Furthermore, the paper discusses several unique research questions related to the privacy protection of medical data in the federated learning environment. We conduct extensive experiments on several benchmark medical image datasets to analyze and mitigate the privacy risks associated with federated learning for medical data.

[LG-9] Probabilistic Analysis of Least Squares Orthogonal Projection and QR Factorization Algorithms Subject to Gaussian Noise

链接: https://arxiv.org/abs/2409.18905
作者: Ali Lotfi,Julien Langou,Mohammad Meysami
关键词-EN: Gaussian noise, Liesen, condition number, Abstract, presented in Theorem
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:In this paper, we extend the work of Liesen et al. (2002), which analyzes how the condition number of an orthonormal matrix Q changes when a column is added ([Q, c]), particularly focusing on the perpendicularity of c to the span of Q. Their result, presented in Theorem 2.3 of Liesen et al. (2002), assumes exact arithmetic and orthonormality of Q, which is a strong assumption when applying these results to numerical methods such as QR factorization algorithms. In our work, we address this gap by deriving bounds on the condition number increase for a matrix B without assuming perfect orthonormality, even when a column is not perfectly orthogonal to the span of B. This framework allows us to analyze QR factorization methods where orthogonalization is imperfect and subject to Gaussian noise. We also provide results on the performance of orthogonal projection and least squares under Gaussian noise, further supporting the development of this theory.

[LG-10] Multi-Source Hard and Soft Information Fusion Approach for Accurate Cryptocurrency Price Movement Prediction

链接: https://arxiv.org/abs/2409.18895
作者: Saeed Mohammadi Dashtaki,Mehdi Hosseini Chagahi,Behzad Moshiri,Md. Jalil Piran
关键词-EN: cryptocurrency price trends, field is accurately, cryptocurrency price, information, price
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:One of the most important challenges in the financial and cryptocurrency field is accurately predicting cryptocurrency price trends. Leveraging artificial intelligence (AI) is beneficial in addressing this challenge. Cryptocurrency markets, marked by substantial growth and volatility, attract investors and scholars keen on deciphering and forecasting cryptocurrency price movements. The vast and diverse array of data available for such predictions increases the complexity of the task. In our study, we introduce a novel approach termed hard and soft information fusion (HSIF) to enhance the accuracy of cryptocurrency price movement forecasts. The hard information component of our approach encompasses historical price records alongside technical indicators. Complementing this, the soft data component extracts from X (formerly Twitter), encompassing news headlines and tweets about the cryptocurrency. To use this data, we use the Bidirectional Encoder Representations from Transformers (BERT)-based sentiment analysis method, financial BERT (FinBERT), which performs best. Finally, our model feeds on the information set including processed hard and soft data. We employ the bidirectional long short-term memory (BiLSTM) model because processing information in both forward and backward directions can capture long-term dependencies in sequential information. Our empirical findings emphasize the superiority of the HSIF approach over models dependent on single-source data by testing on Bitcoin-related data. By fusing hard and soft information on Bitcoin dataset, our model has about 96.8% accuracy in predicting price movement. Incorporating information enables our model to grasp the influence of social sentiment on price fluctuations, thereby supplementing the technical analysis-based predictions derived from hard information.

[LG-11] HM3: Hierarchical Multi-Objective Model Merging for Pretrained Models

链接: https://arxiv.org/abs/2409.18893
作者: Yu Zhou,Xingyu Wu,Jibin Wu,Liang Feng,Kay Chen Tan
关键词-EN: broader task adaptability, large pretrained models, large pretrained, combines multiple large, Model merging
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model merging is a technique that combines multiple large pretrained models into a single model with enhanced performance and broader task adaptability. It has gained popularity in large pretrained model development due to its ability to bypass the need for original training data and further training processes. However, most existing model merging approaches focus solely on exploring the parameter space, merging models with identical architectures. Merging within the architecture space, despite its potential, remains in its early stages due to the vast search space and the challenges of layer compatibility. This paper marks a significant advance toward more flexible and comprehensive model merging techniques by modeling the architecture-space merging process as a reinforcement learning task. We train policy and value networks using offline sampling of weight vectors, which are then employed for the online optimization of merging strategies. Moreover, a multi-objective optimization paradigm is introduced to accommodate users’ diverse task preferences, learning the Pareto front of optimal models to offer customized merging suggestions. Experimental results across multiple tasks, including text translation, mathematical reasoning, and code generation, validate the effectiveness and superiority of the proposed framework in model merging. The code will be made publicly available after the review process.

[LG-12] HR-Extreme: A High-Resolution Dataset for Extreme Weather Forecasting

链接: https://arxiv.org/abs/2409.18885
作者: Nian Ran,Peng Xiao,Yue Wang,Wesley Shi,Jianxin Lin,Qi Meng,Richard Allmendinger
关键词-EN: Pangu and Fuxi, including higher-resolution forecasting, extended prediction periods, prediction periods exemplified, extreme weather
类目: Machine Learning (cs.LG)
*备注: 10 pages, under review

点击查看摘要

Abstract:The application of large deep learning models in weather forecasting has led to significant advancements in the field, including higher-resolution forecasting and extended prediction periods exemplified by models such as Pangu and Fuxi. Despite these successes, previous research has largely been characterized by the neglect of extreme weather events, and the availability of datasets specifically curated for such events remains limited. Given the critical importance of accurately forecasting extreme weather, this study introduces a comprehensive dataset that incorporates high-resolution extreme weather cases derived from the High-Resolution Rapid Refresh (HRRR) data, a 3-km real-time dataset provided by NOAA. We also evaluate the current state-of-the-art deep learning models and Numerical Weather Prediction (NWP) systems on HR-Extreme, and provide a improved baseline deep learning model called HR-Heim which has superior performance on both general loss and HR-Extreme compared to others. Our results reveal that the errors of extreme weather cases are significantly larger than overall forecast error, highlighting them as an crucial source of loss in weather prediction. These findings underscore the necessity for future research to focus on improving the accuracy of extreme weather forecasts to enhance their practical utility.

[LG-13] CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly Detection and Forecasting

链接: https://arxiv.org/abs/2409.18874
作者: Josef Koumar,Karel Hynek,Tomáš Čejka,Pavel Šiška
关键词-EN: identifying malicious activities, Anomaly detection, malicious activities, Anomaly, crucial for maintaining
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Anomaly detection in network traffic is crucial for maintaining the security of computer networks and identifying malicious activities. One of the primary approaches to anomaly detection are methods based on forecasting. Nevertheless, extensive real-world network datasets for forecasting and anomaly detection techniques are missing, potentially causing performance overestimation of anomaly detection algorithms. This manuscript addresses this gap by introducing a dataset comprising time series data of network entities’ behavior, collected from the CESNET3 network. The dataset was created from 40 weeks of network traffic of 275 thousand active IP addresses. The ISP origin of the presented data ensures a high level of variability among network entities, which forms a unique and authentic challenge for forecasting and anomaly detection models. It provides valuable insights into the practical deployment of forecast-based anomaly detection approaches.

[LG-14] Individuation in Neural Models with and without Visual Grounding

链接: https://arxiv.org/abs/2409.18868
作者: Alexey Tikhonov,Lisa Bylinina,Ivan P. Yamshchikov
关键词-EN: FastText and SBERT, CLIP, SBERT, CLIP embeddings, individuation information
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We show differences between a language-and-vision model CLIP and two text-only models - FastText and SBERT - when it comes to the encoding of individuation information. We study latent representations that CLIP provides for substrates, granular aggregates, and various numbers of objects. We demonstrate that CLIP embeddings capture quantitative differences in individuation better than models trained on text-only data. Moreover, the individuation hierarchy we deduce from the CLIP embeddings agrees with the hierarchies proposed in linguistics and cognitive science.

[LG-15] Challenges of Generating Structurally Diverse Graphs

链接: https://arxiv.org/abs/2409.18859
作者: Fedor Velikonivtsev,Mikhail Mironov,Liudmila Prokhorenkova
关键词-EN: structurally diverse graphs, structurally diverse, diversity measure, diversity, graphs
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:For many graph-related problems, it can be essential to have a set of structurally diverse graphs. For instance, such graphs can be used for testing graph algorithms or their neural approximations. However, to the best of our knowledge, the problem of generating structurally diverse graphs has not been explored in the literature. In this paper, we fill this gap. First, we discuss how to define diversity for a set of graphs, why this task is non-trivial, and how one can choose a proper diversity measure. Then, for a given diversity measure, we propose and compare several algorithms optimizing it: we consider approaches based on standard random graph models, local graph optimization, genetic algorithms, and neural generative models. We show that it is possible to significantly improve diversity over basic random graph generators. Additionally, our analysis of generated graphs allows us to better understand the properties of graph distances: depending on which diversity measure is used for optimization, the obtained graphs may possess very different structural properties which gives insights about the sensitivity of the graph distance underlying the diversity measure.

[LG-16] wo Sparse Matrices are Better than One: Sparsifying Neural Networks with Double Sparse Factorization

链接: https://arxiv.org/abs/2409.18850
作者: Vladimír Boža,Vladimír Macko
关键词-EN: Double Sparse Factorization, Neural networks, Optimal Brain Compression, large size, cs.LG
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural networks are often challenging to work with due to their large size and complexity. To address this, various methods aim to reduce model size by sparsifying or decomposing weight matrices, such as magnitude pruning and low-rank or block-diagonal factorization. In this work, we present Double Sparse Factorization (DSF), where we factorize each weight matrix into two sparse matrices. Although solving this problem exactly is computationally infeasible, we propose an efficient heuristic based on alternating minimization via ADMM that achieves state-of-the-art results, enabling unprecedented sparsification of neural networks. For instance, in a one-shot pruning setting, our method can reduce the size of the LLaMA2-13B model by 50% while maintaining better performance than the dense LLaMA2-7B model. We also compare favorably with Optimal Brain Compression, the state-of-the-art layer-wise pruning approach for convolutional neural networks. Furthermore, accuracy improvements of our method persist even after further model fine-tuning. Code available at: this https URL. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2409.18850 [cs.LG] (or arXiv:2409.18850v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.18850 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-17] Classification and regression of trajectories rendered as images via 2D Convolutional Neural Networks

链接: https://arxiv.org/abs/2409.18832
作者: Mariaclaudia Nicolai,Raffaella Fiamma Cabini,Diego Ulisse Pizzagalli
关键词-EN: typically arising, motile objects, regarded as time-series, arising from motile, Trajectories
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 13 pages, 5 figures

点击查看摘要

Abstract:Trajectories can be regarded as time-series of coordinates, typically arising from motile objects. Methods for trajectory classification are particularly important to detect different movement patterns, while methods for regression to compute motility metrics and forecasting. Recent advances in computer vision have facilitated the processing of trajectories rendered as images via artificial neural networks with 2d convolutional layers (CNNs). This approach leverages the capability of CNNs to learn spatial hierarchies of features from images, necessary to recognize complex shapes. Moreover, it overcomes the limitation of other machine learning methods that require input trajectories with a fixed number of points. However, rendering trajectories as images can introduce poorly investigated artifacts such as information loss due to the plotting of coordinates on a discrete grid, and spectral changes due to line thickness and aliasing. In this study, we investigate the effectiveness of CNNs for solving classification and regression problems from synthetic trajectories that have been rendered as images using different modalities. The parameters considered in this study include line thickness, image resolution, usage of motion history (color-coding of the temporal component) and anti-aliasing. Results highlight the importance of choosing an appropriate image resolution according to model depth and motion history in applications where movement direction is critical.

[LG-18] ARLBench: Flexible and Efficient Benchmarking for Hyperparameter Optimization in Reinforcement Learning

链接: https://arxiv.org/abs/2409.18827
作者: Jannis Becktepe,Julian Dierkes,Carolin Benjamins,Aditya Mohan,David Salinas,Raghu Rajan,Frank Hutter,Holger Hoos,Marius Lindauer,Theresa Eimer
关键词-EN: well-performing reinforcement learning, reliably training well-performing, training well-performing reinforcement, reinforcement learning, critical factor
类目: Machine Learning (cs.LG)
*备注: Accepted at the 17th European Workshop on Reinforcement Learning

点击查看摘要

Abstract:Hyperparameters are a critical factor in reliably training well-performing reinforcement learning (RL) agents. Unfortunately, developing and evaluating automated approaches for tuning such hyperparameters is both costly and time-consuming. As a result, such approaches are often only evaluated on a single domain or algorithm, making comparisons difficult and limiting insights into their generalizability. We propose ARLBench, a benchmark for hyperparameter optimization (HPO) in RL that allows comparisons of diverse HPO approaches while being highly efficient in evaluation. To enable research into HPO in RL, even in settings with low compute resources, we select a representative subset of HPO tasks spanning a variety of algorithm and environment combinations. This selection allows for generating a performance profile of an automated RL (AutoRL) method using only a fraction of the compute previously necessary, enabling a broader range of researchers to work on HPO in RL. With the extensive and large-scale dataset on hyperparameter landscapes that our selection is based on, ARLBench is an efficient, flexible, and future-oriented foundation for research on AutoRL. Both the benchmark and the dataset are available at this https URL.

[LG-19] Esports Debut as a Medal Event at 2023 Asian Games: Exploring Public Perceptions with BERTopic and GPT-4 Topic Fine-Tuning

链接: https://arxiv.org/abs/2409.18798
作者: Tyreal Yizhou Qian,Bo Yu,Weizhe Li,Chenglong Xu
关键词-EN: Asian Games, BERTopic modeling analysis, LLM-enhanced BERTopic modeling, modeling analysis, LLM-enhanced BERTopic
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study examined the public opinions of esports at the 2023 Asian Games and value co-creation during the event using an LLM-enhanced BERTopic modeling analysis. We identified five major themes representing public perceptions, as well as how major stakeholders co-created value within and beyond the esports ecosystem. Key findings highlighted the strategic use of social media marketing to influence public opinion and promote esports events and brands, emphasizing the importance of event logistics and infrastructure. Additionally, the study revealed the co-creation value contributed by stakeholders outside the traditional esports ecosystem, particularly in promoting national representation and performance. Our findings supported the ongoing efforts to legitimize esports as a sport, noting that mainstream recognition remains a challenge. The inclusion of esports as a medal event showcased broader acceptance and helped mitigate negative public perceptions. Moreover, contributions from non-traditional stakeholders underscored the value of cross-subcultural collaborations in esports.

[LG-20] Supervised Learning Model for Key Frame Identification from Cow Teat Videos

链接: https://arxiv.org/abs/2409.18797
作者: Minghao Wang,Pinxue Lin
关键词-EN: cow teat, mastitis risk assessment, proposes a method, method for improving, cow
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:This paper proposes a method for improving the accuracy of mastitis risk assessment in cows using neural networks and video analysis. Mastitis, an infection of the udder tissue, is a critical health problem for cows and can be detected by examining the cow’s teat. Traditionally, veterinarians assess the health of a cow’s teat during the milking process, but this process is limited in time and can weaken the accuracy of the assessment. In commercial farms, cows are recorded by cameras when they are milked in the milking parlor. This paper uses a neural network to identify key frames in the recorded video where the cow’s udder appears intact. These key frames allow veterinarians to have more flexible time to perform health assessments on the teat, increasing their efficiency and accuracy. However, there are challenges in using cow teat video for mastitis risk assessment, such as complex environments, changing cow positions and postures, and difficulty in identifying the udder from the video. To address these challenges, a fusion distance and an ensemble model are proposed to improve the performance (F-score) of identifying key frames from cow teat videos. The results show that these two approaches improve performance compared to using a single distance measure or model.

[LG-21] Hierarchical Federated ADMM

链接: https://arxiv.org/abs/2409.18796
作者: Seyed Mohammad Azimi-Abarghouyi,Nicola Bastianello,Karl H. Johansson,Viktoria Fodor
关键词-EN: alternating direction method, descent-based hierarchical federated, hierarchical federated learning, gradient descent-based hierarchical, widely-used gradient descent-based
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:In this paper, we depart from the widely-used gradient descent-based hierarchical federated learning (FL) algorithms to develop a novel hierarchical FL framework based on the alternating direction method of multipliers (ADMM). Within this framework, we propose two novel FL algorithms, which both use ADMM in the top layer: one that employs ADMM in the lower layer and another that uses the conventional gradient descent-based approach. The proposed framework enhances privacy, and experiments demonstrate the superiority of the proposed algorithms compared to the conventional algorithms in terms of learning convergence and accuracy. Additionally, gradient descent on the lower layer performs well even if the number of local steps is very limited, while ADMM on both layers lead to better performance otherwise.

[LG-22] HardCore Generation: Generating Hard UNSAT Problems for Data Augmentation

链接: https://arxiv.org/abs/2409.18778
作者: Joseph Cotnareanu,Zhanguang Zhang,Hui-Ling Zhen,Yingxue Zhang,Mark Coates
关键词-EN: boolean equation, determining the satisfiability, SAT problems, SAT, deep learning methods
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Efficiently determining the satisfiability of a boolean equation – known as the SAT problem for brevity – is crucial in various industrial problems. Recently, the advent of deep learning methods has introduced significant potential for enhancing SAT solving. However, a major barrier to the advancement of this field has been the scarcity of large, realistic datasets. The majority of current public datasets are either randomly generated or extremely limited, containing only a few examples from unrelated problem families. These datasets are inadequate for meaningful training of deep learning methods. In light of this, researchers have started exploring generative techniques to create data that more accurately reflect SAT problems encountered in practical situations. These methods have so far suffered from either the inability to produce challenging SAT problems or time-scalability obstacles. In this paper we address both by identifying and manipulating the key contributors to a problem’s ``hardness’', known as cores. Although some previous work has addressed cores, the time costs are unacceptably high due to the expense of traditional heuristic core detection techniques. We introduce a fast core detection procedure that uses a graph neural network. Our empirical results demonstrate that we can efficiently generate problems that remain hard to solve and retain key attributes of the original example problems. We show via experiment that the generated synthetic SAT problems can be used in a data augmentation setting to provide improved prediction of solver runtimes.

[LG-23] A method of using RSVD in residual calculation of LowBit GEMM

链接: https://arxiv.org/abs/2409.18772
作者: Hongyaoxing Gu
关键词-EN: low-precision applications, advancements of hardware, hardware technology, technology in recent, recent years
类目: Mathematical Software (cs.MS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The advancements of hardware technology in recent years has brought many possibilities for low-precision applications. However, the use of low precision can introduce significant computational errors, posing a considerable challenge to maintaining the computational accuracy. We propose low-rank residuals quantized matrix multiplication(LRQMM) method which introduces low-rank approximation in residual compensation for dense low precision quantization matrix multiplication. It can bring several times accuracy improvement with only BLAS-2 level extra time overhead. Moreover, LRQMM is a completely data-free quantization method that does not require additional data for pre-training. And it only works with low precision GEMM operator, which is easy to couple with other methods. Through experimentation, LRQMM can reduce the error of direct quantized matrix multiplication by 1~2 orders of magnitude, when dealing with larger matrix sizes, the computational speed is only reduced by approximately 20%. In deep learning networks, LRQMM-4bit achieves 61.8% ImageNet Top-1 accuracy in Resnet-50, while the Direct Quant accuracy is only 8.3%. Subjects: Mathematical Software (cs.MS); Machine Learning (cs.LG) Cite as: arXiv:2409.18772 [cs.MS] (or arXiv:2409.18772v1 [cs.MS] for this version) https://doi.org/10.48550/arXiv.2409.18772 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-24] Learning from Demonstration with Implicit Nonlinear Dynamics Models

链接: https://arxiv.org/abs/2409.18768
作者: Peter David Fagan,Subramanian Ramamoorthy
关键词-EN: involving complex motions, solve tasks involving, tasks involving complex, Learning from Demonstration, paradigm for training
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: 21 pages, 9 figures

点击查看摘要

Abstract:Learning from Demonstration (LfD) is a useful paradigm for training policies that solve tasks involving complex motions. In practice, the successful application of LfD requires overcoming error accumulation during policy execution, i.e. the problem of drift due to errors compounding over time and the consequent out-of-distribution behaviours. Existing works seek to address this problem through scaling data collection, correcting policy errors with a human-in-the-loop, temporally ensembling policy predictions or through learning the parameters of a dynamical system model. In this work, we propose and validate an alternative approach to overcoming this issue. Inspired by reservoir computing, we develop a novel neural network layer that includes a fixed nonlinear dynamical system with tunable dynamical properties. We validate the efficacy of our neural network layer on the task of reproducing human handwriting motions using the LASA Human Handwriting Dataset. Through empirical experiments we demonstrate that incorporating our layer into existing neural network architectures addresses the issue of compounding errors in LfD. Furthermore, we perform a comparative evaluation against existing approaches including a temporal ensemble of policy predictions and an Echo State Networks (ESNs) implementation. We find that our approach yields greater policy precision and robustness on the handwriting task while also generalising to multiple dynamics regimes and maintaining competitive latency scores.

[LG-25] nsorSocket: Shared Data Loading for Deep Learning Training

链接: https://arxiv.org/abs/2409.18749
作者: Ties Robroek(IT University of Copenhagen),Neil Kim Nielsen(IT University of Copenhagen),Pınar Tözün(IT University of Copenhagen)
关键词-EN: Training, Tensorsocket, deep learning models, Data, resource-intensive process
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Training deep learning models is a repetitive and resource-intensive process. Data scientists often train several models before landing on set of parameters (e.g., hyper-parameter tuning), model architecture (e.g., neural architecture search), among other things that yields the highest accuracy. The computational efficiency of these training tasks depends highly on how well we can supply the training process with training data. The repetitive nature of these tasks results in the same data processing pipelines running over and over exacerbating the need for and costs of computational resources. In this paper, we present Tensorsocket to reduce the computational needs of deep learning training by enabling simultaneous training processes to share the same data loader. Tensorsocket mitigates CPU-side bottlenecks in cases where the collocated training workloads have high throughput on GPU, but are held back by lower data-loading throughput on CPU. Tensorsocket achieves this by reducing redundant computations across collocated training processes and leveraging modern GPU-GPU interconnects. We demonstrate the hardware- and pipeline-agnostic nature of Tensorsocket and evaluate it using a variety of training scenarios. Our evaluation shows that Tensorsocket enables scenarios that are infeasible without data sharing, increases training throughput by up to 100% , and when utilizing cloud instances, Tensorsocket achieves cost savings of 50% by reducing the hardware resource needs on the CPU side. Furthermore, Tensorsocket outperforms the state-of-the-art solutions for shared data loading such as CoorDL and Joader. It is easier to use, maintain, and deploy, and either achieves higher or matches the throughput of other solutions while requiring less CPU resources. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2409.18749 [cs.LG] (or arXiv:2409.18749v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.18749 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-26] Cottention: Linear Transformers With Cosine Attention

链接: https://arxiv.org/abs/2409.18747
作者: Gabriel Mongaras,Trevor Dohm,Eric C. Larson
关键词-EN: softmax attention, Attention, Cottention, success of transformer-based, transformer-based models
类目: Machine Learning (cs.LG)
*备注: 12 pages, 5 figures

点击查看摘要

Abstract:Attention mechanisms, particularly softmax attention, have been instrumental in the success of transformer-based models such as GPT. However, the quadratic memory complexity of softmax attention with respect to sequence length poses significant challenges for processing longer sequences. We introduce Cottention, a novel attention mechanism that replaces the softmax operation with cosine similarity. By leveraging the properties of cosine similarity and rearranging the attention equation, Cottention achieves native linear memory complexity with respect to sequence length, making it inherently more memory-efficient than softmax attention. We demonstrate that Cottention can be reformulated as a recurrent neural network (RNN) with a finite hidden state, allowing for constant memory usage during inference. We evaluate Cottention on both the bidirectional BERT and causal GPT tasks, demonstrating comparable performance to softmax attention while significantly reducing memory requirements. To ensure efficient computation, we develop a custom CUDA kernel for Cottention. Our results show that Cottention is a promising alternative to softmax attention, enabling the processing of longer sequences without sacrificing performance, due to its native linear memory complexity and ability to maintain a constant memory footprint during inference.

[LG-27] MemFusionMap: Working Memory Fusion for Online Vectorized HD Map Construction

链接: https://arxiv.org/abs/2409.18737
作者: Jingyu Song,Xudong Chen,Liupei Lu,Jie Li,Katherine A. Skinner
关键词-EN: autonomous driving systems, maps provide environmental, provide environmental information, safe planning, provide environmental
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:High-definition (HD) maps provide environmental information for autonomous driving systems and are essential for safe planning. While existing methods with single-frame input achieve impressive performance for online vectorized HD map construction, they still struggle with complex scenarios and occlusions. We propose MemFusionMap, a novel temporal fusion model with enhanced temporal reasoning capabilities for online HD map construction. Specifically, we contribute a working memory fusion module that improves the model’s memory capacity to reason across history frames. We also design a novel temporal overlap heatmap to explicitly inform the model about the temporal overlap information and vehicle trajectory in the Bird’s Eye View space. By integrating these two designs, MemFusionMap significantly outperforms existing methods while also maintaining a versatile design for scalability. We conduct extensive evaluation on open-source benchmarks and demonstrate a maximum improvement of 5.4% in mAP over state-of-the-art methods. The code for MemFusionMap will be made open-source upon publication of this paper.

[LG-28] Autoregressive Policy Optimization for Constrained Allocation Tasks NEURIPS2024

链接: https://arxiv.org/abs/2409.18735
作者: David Winkel,Niklas Strauß,Maximilian Bernhard,Zongyue Li,Thomas Seidl,Matthias Schubert
关键词-EN: Allocation tasks represent, Allocation tasks, represent a class, class of problems, limited amount
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2024

点击查看摘要

Abstract:Allocation tasks represent a class of problems where a limited amount of resources must be allocated to a set of entities at each time step. Prominent examples of this task include portfolio optimization or distributing computational workloads across servers. Allocation tasks are typically bound by linear constraints describing practical requirements that have to be strictly fulfilled at all times. In portfolio optimization, for example, investors may be obligated to allocate less than 30% of the funds into a certain industrial sector in any investment period. Such constraints restrict the action space of allowed allocations in intricate ways, which makes learning a policy that avoids constraint violations difficult. In this paper, we propose a new method for constrained allocation tasks based on an autoregressive process to sequentially sample allocations for each entity. In addition, we introduce a novel de-biasing mechanism to counter the initial bias caused by sequential sampling. We demonstrate the superior performance of our approach compared to a variety of Constrained Reinforcement Learning (CRL) methods on three distinct constrained allocation tasks: portfolio optimization, computational workload distribution, and a synthetic allocation benchmark. Our code is available at: this https URL

[LG-29] Scalable Cross-Entropy Loss for Sequential Recommendations with Large Item Catalogs RECSYS’24

链接: https://arxiv.org/abs/2409.18721
作者: Gleb Mezentsev,Danil Gusak,Ivan Oseledets,Evgeny Frolov
关键词-EN: Scalability issue plays, modern recommender systems, productionizing modern recommender, Scalability issue, recommender systems
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 11 pages, accepted for RecSys’24

点击查看摘要

Abstract:Scalability issue plays a crucial role in productionizing modern recommender systems. Even lightweight architectures may suffer from high computational overload due to intermediate calculations, limiting their practicality in real-world applications. Specifically, applying full Cross-Entropy (CE) loss often yields state-of-the-art performance in terms of recommendations quality. Still, it suffers from excessive GPU memory utilization when dealing with large item catalogs. This paper introduces a novel Scalable Cross-Entropy (SCE) loss function in the sequential learning setup. It approximates the CE loss for datasets with large-size catalogs, enhancing both time efficiency and memory usage without compromising recommendations quality. Unlike traditional negative sampling methods, our approach utilizes a selective GPU-efficient computation strategy, focusing on the most informative elements of the catalog, particularly those most likely to be false positives. This is achieved by approximating the softmax distribution over a subset of the model outputs through the maximum inner product search. Experimental results on multiple datasets demonstrate the effectiveness of SCE in reducing peak memory usage by a factor of up to 100 compared to the alternatives, retaining or even exceeding their metrics values. The proposed approach also opens new perspectives for large-scale developments in different domains, such as large language models.

[LG-30] Enhancing Spectrum Efficiency in 6G Satellite Networks: A GAIL-Powered Policy Learning via Asynchronous Federated Inverse Reinforcement Learning

链接: https://arxiv.org/abs/2409.18718
作者: Sheikh Salman Hassan,Yu Min Park,Yan Kyaw Tun,Walid Saad,Zhu Han,Choong Seon Hong
关键词-EN: remote user equipment, generative adversarial imitation, adversarial imitation learning, optimizing beamforming, user equipment
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: Submitted to IEEE Transactions on Mobile Computing (16 pages, 10 figures)

点击查看摘要

Abstract:In this paper, a novel generative adversarial imitation learning (GAIL)-powered policy learning approach is proposed for optimizing beamforming, spectrum allocation, and remote user equipment (RUE) association in NTNs. Traditional reinforcement learning (RL) methods for wireless network optimization often rely on manually designed reward functions, which can require extensive parameter tuning. To overcome these limitations, we employ inverse RL (IRL), specifically leveraging the GAIL framework, to automatically learn reward functions without manual design. We augment this framework with an asynchronous federated learning approach, enabling decentralized multi-satellite systems to collaboratively derive optimal policies. The proposed method aims to maximize spectrum efficiency (SE) while meeting minimum information rate requirements for RUEs. To address the non-convex, NP-hard nature of this problem, we combine the many-to-one matching theory with a multi-agent asynchronous federated IRL (MA-AFIRL) framework. This allows agents to learn through asynchronous environmental interactions, improving training efficiency and scalability. The expert policy is generated using the Whale optimization algorithm (WOA), providing data to train the automatic reward function within GAIL. Simulation results show that the proposed MA-AFIRL method outperforms traditional RL approaches, achieving a 14.6% improvement in convergence and reward value. The novel GAIL-driven policy learning establishes a novel benchmark for 6G NTN optimization.

[LG-31] Rethinking the Power of Timestamps for Robust Time Series Forecasting: A Global-Local Fusion Perspective NEURIPS2024

链接: https://arxiv.org/abs/2409.18696
作者: Chengsen Wang,Qi Qi,Jingyu Wang,Haifeng Sun,Zirui Zhuang,Jinming Wu,Jianxin Liao
关键词-EN: including finance, played a pivotal, pivotal role, Time series forecasting, Time series
类目: Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Time series forecasting has played a pivotal role across various industries, including finance, transportation, energy, healthcare, and climate. Due to the abundant seasonal information they contain, timestamps possess the potential to offer robust global guidance for forecasting techniques. However, existing works primarily focus on local observations, with timestamps being treated merely as an optional supplement that remains underutilized. When data gathered from the real world is polluted, the absence of global information will damage the robust prediction capability of these algorithms. To address these problems, we propose a novel framework named GLAFF. Within this framework, the timestamps are modeled individually to capture the global dependencies. Working as a plugin, GLAFF adaptively adjusts the combined weights for global and local information, enabling seamless collaboration with any time series forecasting backbone. Extensive experiments conducted on nine real-world datasets demonstrate that GLAFF significantly enhances the average performance of widely used mainstream forecasting models by 12.5%, surpassing the previous state-of-the-art method by 5.5%.

[LG-32] Understanding the Benefits of SimCLR Pre-Training in Two-Layer Convolutional Neural Networks

链接: https://arxiv.org/abs/2409.18685
作者: Han Zhang,Yuan Cao
关键词-EN: popular contrastive learning, vision tasks, popular contrastive, SimCLR, neural network
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 65 pages, 4 figures

点击查看摘要

Abstract:SimCLR is one of the most popular contrastive learning methods for vision tasks. It pre-trains deep neural networks based on a large amount of unlabeled data by teaching the model to distinguish between positive and negative pairs of augmented images. It is believed that SimCLR can pre-train a deep neural network to learn efficient representations that can lead to a better performance of future supervised fine-tuning. Despite its effectiveness, our theoretical understanding of the underlying mechanisms of SimCLR is still limited. In this paper, we theoretically introduce a case study of the SimCLR method. Specifically, we consider training a two-layer convolutional neural network (CNN) to learn a toy image data model. We show that, under certain conditions on the number of labeled data, SimCLR pre-training combined with supervised fine-tuning achieves almost optimal test loss. Notably, the label complexity for SimCLR pre-training is far less demanding compared to direct training on supervised data. Our analysis sheds light on the benefits of SimCLR in learning with fewer labels.

[LG-33] How green is continual learning really? Analyzing the energy consumption in continual training of vision foundation models ECCV2024

链接: https://arxiv.org/abs/2409.18664
作者: Tomaso Trinci,Simone Magistri,Roberto Verdecchia,Andrew D. Bagdanov
关键词-EN: continual learning algorithms, continual learning, longer negligible, ever-growing adoption, learning
类目: Machine Learning (cs.LG)
*备注: This manuscript has been accepted at the Green FOundation MOdels (GreenFOMO) ECCV 2024 Workshop

点击查看摘要

Abstract:With the ever-growing adoption of AI, its impact on the environment is no longer negligible. Despite the potential that continual learning could have towards Green AI, its environmental sustainability remains relatively uncharted. In this work we aim to gain a systematic understanding of the energy efficiency of continual learning algorithms. To that end, we conducted an extensive set of empirical experiments comparing the energy consumption of recent representation-, prompt-, and exemplar-based continual learning algorithms and two standard baseline (fine tuning and joint training) when used to continually adapt a pre-trained ViT-B/16 foundation model. We performed our experiments on three standard datasets: CIFAR-100, ImageNet-R, and DomainNet. Additionally, we propose a novel metric, the Energy NetScore, which we use measure the algorithm efficiency in terms of energy-accuracy trade-off. Through numerous evaluations varying the number and size of the incremental learning steps, our experiments demonstrate that different types of continual learning algorithms have very different impacts on energy consumption during both training and inference. Although often overlooked in the continual learning literature, we found that the energy consumed during the inference phase is crucial for evaluating the environmental sustainability of continual learning models.

[LG-34] Entropy concentration and learning: a statistical mechanics primer

链接: https://arxiv.org/abs/2409.18630
作者: Akshay Balsubramani
关键词-EN:
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-35] Unsupervised Cognition

链接: https://arxiv.org/abs/2409.18624
作者: Alfredo Ibias,Hector Antona,Guillem Ramirez-Miranda,Enric Guinovart,Eduard Alarcon
关键词-EN: Unsupervised learning methods, Unsupervised learning, soft inspiration, learning methods, Unsupervised
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unsupervised learning methods have a soft inspiration in cognition models. To this day, the most successful unsupervised learning methods revolve around clustering samples in a mathematical space. In this paper we propose a state-of-the-art primitive-based unsupervised learning approach for decision-making inspired by novel cognition models. This representation-centric approach models the input space constructively as a distributed hierarchical structure in an input-agnostic way. We compared our approach with current state-of-the-art in unsupervised learning classification, and with current state-of-the-art in cancer type classification. We show how our proposal outperforms previous state-of-the-art. We also evaluate some cognition-like properties of our proposal where it not only outperforms the compared algorithms (even supervised learning ones), but it also shows a different, more cognition-like, behaviour.

[LG-36] Differentially Private Non Parametric Copulas: Generating synthetic data with non parametric copulas under privacy guarantees

链接: https://arxiv.org/abs/2409.18611
作者: Pablo A. Osorio-Marulanda,John Esteban Castro Ramirez,Mikel Hernández Jiménez,Nicolas Moreno Reyes,Gorka Epelde Unanue
关键词-EN: diverse scientific fields, Enhanced Fourier Perturbation, brings important privacy, important privacy considerations, synthetic data
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: 12 pages, 5 figures, deciding 2025 conference to which to submit

点击查看摘要

Abstract:Creation of synthetic data models has represented a significant advancement across diverse scientific fields, but this technology also brings important privacy considerations for users. This work focuses on enhancing a non-parametric copula-based synthetic data generation model, DPNPC, by incorporating Differential Privacy through an Enhanced Fourier Perturbation method. The model generates synthetic data for mixed tabular databases while preserving privacy. We compare DPNPC with three other models (PrivBayes, DP-Copula, and DP-Histogram) across three public datasets, evaluating privacy, utility, and execution time. DPNPC outperforms others in modeling multivariate dependencies, maintaining privacy for small \epsilon values, and reducing training times. However, limitations include the need to assess the model’s performance with different encoding methods and consider additional privacy attacks. Future research should address these areas to enhance privacy-preserving synthetic data generation.

[LG-37] mporalPaD: a reinforcement-learning framework for temporal feature representation and dimension reduction

链接: https://arxiv.org/abs/2409.18597
作者: Xuechen Mu,Zhenyu Huang,Kewei Li,Haotian Zhang,Xiuli Wang,Yusi Fan,Kai Zhang,Fengfeng Zhou
关键词-EN: Recent advancements, Representation Module, predictive modeling, Policy Module, Module
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:Recent advancements in feature representation and dimension reduction have highlighted their crucial role in enhancing the efficacy of predictive modeling. This work introduces TemporalPaD, a novel end-to-end deep learning framework designed for temporal pattern datasets. TemporalPaD integrates reinforcement learning (RL) with neural networks to achieve concurrent feature representation and feature reduction. The framework consists of three cooperative modules: a Policy Module, a Representation Module, and a Classification Module, structured based on the Actor-Critic (AC) framework. The Policy Module, responsible for dimensionality reduction through RL, functions as the actor, while the Representation Module for feature extraction and the Classification Module collectively serve as the critic. We comprehensively evaluate TemporalPaD using 29 UCI datasets, a well-known benchmark for validating feature reduction algorithms, through 10 independent tests and 10-fold cross-validation. Additionally, given that TemporalPaD is specifically designed for time series data, we apply it to a real-world DNA classification problem involving enhancer category and enhancer strength. The results demonstrate that TemporalPaD is an efficient and effective framework for achieving feature reduction, applicable to both structured data and sequence datasets. The source code of the proposed TemporalPaD is freely available as supplementary material to this article and at this http URL.

[LG-38] ASAG2024: A Combined Benchmark for Short Answer Grading

链接: https://arxiv.org/abs/2409.18596
作者: Gérôme Meyer,Philip Breuer,Jonathan Fürst
关键词-EN: Open-ended questions test, Open-ended questions, preferred assessment method, understanding than closed-ended, preferred assessment
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted at SIGCSE-Virtual 2024

点击查看摘要

Abstract:Open-ended questions test a more thorough understanding than closed-ended questions and are often a preferred assessment method. However, open-ended questions are tedious to grade and subject to personal bias. Therefore, there have been efforts to speed up the grading process through automation. Short Answer Grading (SAG) systems aim to automatically score students’ answers. Despite growth in SAG methods and capabilities, there exists no comprehensive short-answer grading benchmark across different subjects, grading scales, and distributions. Thus, it is hard to assess the capabilities of current automated grading methods in terms of their generalizability. In this preliminary work, we introduce the combined ASAG2024 benchmark to facilitate the comparison of automated grading systems. Combining seven commonly used short-answer grading datasets in a common structure and grading scale. For our benchmark, we evaluate a set of recent SAG methods, revealing that while LLM-based approaches reach new high scores, they still are far from reaching human performance. This opens up avenues for future research on human-machine SAG systems.

[LG-39] “Oh LLM Im Asking Thee Please Give Me a Decision Tree”: Zero-Shot Decision Tree Induction and Embedding with Large Language Models

链接: https://arxiv.org/abs/2409.18594
作者: Ricardo Knauer,Mario Koddenbrock,Raphael Wallsberger,Nicholas M. Brisson,Georg N. Duda,Deborah Falla,David W. Evans,Erik Rodner
关键词-EN: Large language models, Large language, leverage prior knowledge, provide powerful, leverage prior
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) provide powerful means to leverage prior knowledge for predictive modeling when data is limited. In this work, we demonstrate how LLMs can use their compressed world knowledge to generate intrinsically interpretable machine learning models, i.e., decision trees, without any training data. We find that these zero-shot decision trees can surpass data-driven trees on some small-sized tabular datasets and that embeddings derived from these trees perform on par with data-driven tree-based embeddings on average. Our knowledge-driven decision tree induction and embedding approaches therefore serve as strong new baselines for data-driven machine learning methods in the low-data regime.

[LG-40] Optimistic Games for Combinatorial Bayesian Optimization with Application to Protein Design

链接: https://arxiv.org/abs/2409.18582
作者: Melis Ilayda Bal,Pier Giuseppe Sessa,Mojmir Mutny,Andreas Krause
关键词-EN: Bayesian optimization, optimize black-box, sequential interactions, powerful framework, framework to optimize
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bayesian optimization (BO) is a powerful framework to optimize black-box expensive-to-evaluate functions via sequential interactions. In several important problems (e.g. drug discovery, circuit design, neural architecture search, etc.), though, such functions are defined over large \textitcombinatorial and unstructured spaces. This makes existing BO algorithms not feasible due to the intractable maximization of the acquisition function over these domains. To address this issue, we propose \textbfGameOpt , a novel game-theoretical approach to combinatorial BO. \textbfGameOpt establishes a cooperative game between the different optimization variables, and selects points that are game \textitequilibria of an upper confidence bound acquisition function. These are stable configurations from which no variable has an incentive to deviate - analog to local optima in continuous domains. Crucially, this allows us to efficiently break down the complexity of the combinatorial domain into individual decision sets, making \textbfGameOpt scalable to large combinatorial spaces. We demonstrate the application of \textbfGameOpt to the challenging \textitprotein design problem and validate its performance on four real-world protein datasets. Each protein can take up to 20^X possible configurations, where X is the length of a protein, making standard BO methods infeasible. Instead, our approach iteratively selects informative protein configurations and very quickly discovers highly active protein variants compared to other baselines.

[LG-41] Using Deep Autoregressive Models as Causal Inference Engines

链接: https://arxiv.org/abs/2409.18581
作者: Daniel Jiwoong Im,Kevin Zhang,Nakul Verma,Kyunghyun Cho
关键词-EN: primarily handling low-dimensional, handling low-dimensional confounders, limited to primarily, low-dimensional confounders, confounders and singleton
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Existing causal inference (CI) models are limited to primarily handling low-dimensional confounders and singleton actions. We propose an autoregressive (AR) CI framework capable of handling complex confounders and sequential actions common in modern applications. We accomplish this by \em sequencification, transforming data from an underlying causal diagram into a sequence of tokens. This approach not only enables training with data generated from any DAG but also extends existing CI capabilities to accommodate estimating several statistical quantities using a \em single model. We can directly predict interventional probabilities, simplifying inference and enhancing outcome prediction accuracy. We demonstrate that an AR model adapted for CI is efficient and effective in various complex applications such as navigating mazes, playing chess endgames, and evaluating the impact of certain keywords on paper acceptance rates.

[LG-42] An Enhanced Federated Prototype Learning Method under Domain Shift

链接: https://arxiv.org/abs/2409.18578
作者: Liang Kuang,Kuangpu Guo,Jian Liang,Jianguo Zhang
关键词-EN: machine learning training, collaborative machine learning, sharing private data, Federated Learning, Federated Prototype Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:Federated Learning (FL) allows collaborative machine learning training without sharing private data. Numerous studies have shown that one significant factor affecting the performance of federated learning models is the heterogeneity of data across different clients, especially when the data is sampled from various domains. A recent paper introduces variance-aware dual-level prototype clustering and uses a novel \alpha -sparsity prototype loss, which increases intra-class similarity and reduces inter-class similarity. To ensure that the features converge within specific clusters, we introduce an improved algorithm, Federated Prototype Learning with Convergent Clusters, abbreviated as FedPLCC. To increase inter-class distances, we weight each prototype with the size of the cluster it represents. To reduce intra-class distances, considering that prototypes with larger distances might come from different domains, we select only a certain proportion of prototypes for the loss function calculation. Evaluations on the Digit-5, Office-10, and DomainNet datasets show that our method performs better than existing approaches.

[LG-43] Climate Adaptation with Reinforcement Learning: Experiments with Flooding and Transportation in Copenhagen

链接: https://arxiv.org/abs/2409.18574
作者: Miguel Costa,Morten W. Petersen,Arthur Vandervoort,Martin Drews,Karyn Morrissey,Francisco C. Pereira
关键词-EN: extreme rainfall events, frequency and intensity, intensity of extreme, expected to increase, Due to climate
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Due to climate change the frequency and intensity of extreme rainfall events, which contribute to urban flooding, are expected to increase in many places. These floods can damage transport infrastructure and disrupt mobility, highlighting the need for cities to adapt to escalating risks. Reinforcement learning (RL) serves as a powerful tool for uncovering optimal adaptation strategies, determining how and where to deploy adaptation measures effectively, even under significant uncertainty. In this study, we leverage RL to identify the most effective timing and locations for implementing measures, aiming to reduce both direct and indirect impacts of flooding. Our framework integrates climate change projections of future rainfall events and floods, models city-wide motorized trips, and quantifies direct and indirect impacts on infrastructure and mobility. Preliminary results suggest that our RL-based approach can significantly enhance decision-making by prioritizing interventions in specific urban areas and identifying the optimal periods for their implementation.

[LG-44] owards an active-learning approach to resource allocation for population-based damage prognosis

链接: https://arxiv.org/abs/2409.18572
作者: George Tsialiamanis,Keith Worden,Nikolaos Dervilis,Aidan J Hughes
关键词-EN: structural health monitoring, structural health, current work, Damage prognosis, population-based SHM
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Damage prognosis is, arguably, one of the most difficult tasks of structural health monitoring (SHM). To address common problems of damage prognosis, a population-based SHM (PBSHM) approach is adopted in the current work. In this approach the prognosis problem is considered as an information-sharing problem where data from past structures are exploited to make more accurate inferences regarding currently-degrading structures. For a given population, there may exist restrictions on the resources available to conduct monitoring; thus, the current work studies the problem of allocating such resources within a population of degrading structures with a view to maximising the damage-prognosis accuracy. The challenges of the current framework are mainly associated with the inference of outliers on the level of damage evolution, given partial data from the damage-evolution phenomenon. The current approach considers an initial population of structures for which damage evolution is extensively observed. Subsequently, a second population of structures with evolving damage is considered for which two monitoring systems are available, a low-availability and high-fidelity (low-uncertainty) one, and a widely-available and low-fidelity (high-uncertainty) one. The task of the current work is to follow an active-learning approach to identify the structures to which the high-fidelity system should be assigned in order to enhance the predictive capabilities of the machine-learning model throughout the population.

[LG-45] Experimental Evaluation of Machine Learning Models for Goal-oriented Customer Service Chatbot with Pipeline Architecture

链接: https://arxiv.org/abs/2409.18568
作者: Nurul Ain Nabilah Mohd Isa,Siti Nuraishah Agos Jawaddi,Azlan Ismail
关键词-EN: Integrating machine learning, Integrating machine, ultimately improving service, customer service chatbots, service chatbots enhances
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Integrating machine learning (ML) into customer service chatbots enhances their ability to understand and respond to user queries, ultimately improving service performance. However, they may appear artificial to some users and affecting customer experience. Hence, meticulous evaluation of ML models for each pipeline component is crucial for optimizing performance, though differences in functionalities can lead to unfair comparisons. In this paper, we present a tailored experimental evaluation approach for goal-oriented customer service chatbots with pipeline architecture, focusing on three key components: Natural Language Understanding (NLU), dialogue management (DM), and Natural Language Generation (NLG). Our methodology emphasizes individual assessment to determine optimal ML models. Specifically, we focus on optimizing hyperparameters and evaluating candidate models for NLU (utilizing BERT and LSTM), DM (employing DQN and DDQN), and NLG (leveraging GPT-2 and DialoGPT). The results show that for the NLU component, BERT excelled in intent detection whereas LSTM was superior for slot filling. For the DM component, the DDQN model outperformed DQN by achieving fewer turns, higher rewards, as well as greater success rates. For NLG, the large language model GPT-2 surpassed DialoGPT in BLEU, METEOR, and ROUGE metrics. These findings aim to provide a benchmark for future research in developing and optimizing customer service chatbots, offering valuable insights into model performance and optimal hyperparameters.

[LG-46] Optimizing DNN Inference on Multi-Accelerator SoCs at Training-time

链接: https://arxiv.org/abs/2409.18566
作者: Matteo Risso,Alessio Burrello,Daniele Jahier Pagliari
关键词-EN: executing Deep Neural, Deep Neural Networks, specialized computing units, executing Deep, incorporate multiple specialized
类目: Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:The demand for executing Deep Neural Networks (DNNs) with low latency and minimal power consumption at the edge has led to the development of advanced heterogeneous Systems-on-Chips (SoCs) that incorporate multiple specialized computing units (CUs), such as accelerators. Offloading DNN computations to a specific CU from the available set often exposes accuracy vs efficiency trade-offs, due to differences in their supported operations (e.g., standard vs. depthwise convolution) or data representations (e.g., more/less aggressively quantized). A challenging yet unresolved issue is how to map a DNN onto these multi-CU systems to maximally exploit the parallelization possibilities while taking accuracy into account. To address this problem, we present ODiMO, a hardware-aware tool that efficiently explores fine-grain mapping of DNNs among various on-chip CUs, during the training phase. ODiMO strategically splits individual layers of the neural network and executes them in parallel on the multiple available CUs, aiming to balance the total inference energy consumption or latency with the resulting accuracy, impacted by the unique features of the different hardware units. We test our approach on CIFAR-10, CIFAR-100, and ImageNet, targeting two open-source heterogeneous SoCs, i.e., DIANA and Darkside. We obtain a rich collection of Pareto-optimal networks in the accuracy vs. energy or latency space. We show that ODiMO reduces the latency of a DNN executed on the Darkside SoC by up to 8x at iso-accuracy, compared to manual heuristic mappings. When targeting energy, on the same SoC, ODiMO produced up to 50.8x more efficient mappings, with minimal accuracy drop ( 0.3%).

[LG-47] CodeSCAN: ScreenCast ANalysis for Video Programming Tutorials

链接: https://arxiv.org/abs/2409.18556
作者: Alexander Naumann,Felix Hertlein,Jacqueline Höllig,Lucas Cazzonelli,Steffen Thoma
关键词-EN: coding screencasts play, serving both novices, experienced developers, play a crucial, crucial role
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Programming tutorials in the form of coding screencasts play a crucial role in programming education, serving both novices and experienced developers. However, the video format of these tutorials presents a challenge due to the difficulty of searching for and within videos. Addressing the absence of large-scale and diverse datasets for screencast analysis, we introduce the CodeSCAN dataset. It comprises 12,000 screenshots captured from the Visual Studio Code environment during development, featuring 24 programming languages, 25 fonts, and over 90 distinct themes, in addition to diverse layout changes and realistic user interactions. Moreover, we conduct detailed quantitative and qualitative evaluations to benchmark the performance of Integrated Development Environment (IDE) element detection, color-to-black-and-white conversion, and Optical Character Recognition (OCR). We hope that our contributions facilitate more research in coding screencast analysis, and we make the source code for creating the dataset and the benchmark publicly available on this website.

[LG-48] Efficient Noise Mitigation for Enhancing Inference Accuracy in DNNs on Mixed-Signal Accelerators

链接: https://arxiv.org/abs/2409.18553
作者: Seyedarmin Azizi,Mohammad Erfan Sadeghi,Mehdi Kamal,Massoud Pedram
关键词-EN: analog computing components, analog neural networks, analog computing, framework to enhance, mitigating the effects
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we propose a framework to enhance the robustness of the neural models by mitigating the effects of process-induced and aging-related variations of analog computing components on the accuracy of the analog neural networks. We model these variations as the noise affecting the precision of the activations and introduce a denoising block inserted between selected layers of a pre-trained model. We demonstrate that training the denoising block significantly increases the model’s robustness against various noise levels. To minimize the overhead associated with adding these blocks, we present an exploration algorithm to identify optimal insertion points for the denoising blocks. Additionally, we propose a specialized architecture to efficiently execute the denoising blocks, which can be integrated into mixed-signal accelerators. We evaluate the effectiveness of our approach using Deep Neural Network (DNN) models trained on the ImageNet and CIFAR-10 datasets. The results show that on average, by accepting 2.03% parameter count overhead, the accuracy drop due to the variations reduces from 31.7% to 1.15%.

[LG-49] Wasserstein Distance-Weighted Adversarial Network for Cross-Domain Credit Risk Assessment

链接: https://arxiv.org/abs/2409.18544
作者: Mohan Jiang,Jiating Lin,Hongju Ouyang,Jingming Pan,Siyuan Han,Bingyao Liu
关键词-EN: adversarial domain adaptation, Domain Adaptation Network, Weighted Adversarial Domain, Distance Weighted Adversarial, Wasserstein Distance Weighted
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper delves into the application of adversarial domain adaptation (ADA) for enhancing credit risk assessment in financial institutions. It addresses two critical challenges: the cold start problem, where historical lending data is scarce, and the data imbalance issue, where high-risk transactions are underrepresented. The paper introduces an improved ADA framework, the Wasserstein Distance Weighted Adversarial Domain Adaptation Network (WD-WADA), which leverages the Wasserstein distance to align source and target domains effectively. The proposed method includes an innovative weighted strategy to tackle data imbalance, adjusting for both the class distribution and the difficulty level of predictions. The paper demonstrates that WD-WADA not only mitigates the cold start problem but also provides a more accurate measure of domain differences, leading to improved cross-domain credit risk assessment. Extensive experiments on real-world credit datasets validate the model’s effectiveness, showcasing superior performance in cross-domain learning, classification accuracy, and model stability compared to traditional methods.

[LG-50] oken Caching for Diffusion Transformer Acceleration

链接: https://arxiv.org/abs/2409.18523
作者: Jinming Lou,Wenyang Luo,Yufan Liu,Bing Li,Xinmiao Ding,Weiming Hu,Jiajiong Cao,Yuming Li,Chenguang Ma
关键词-EN: gained substantial interest, generative modeling due, diffusion generative modeling, gained substantial, substantial interest
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion transformers have gained substantial interest in diffusion generative modeling due to their outstanding performance. However, their high computational cost, arising from the quadratic computational complexity of attention mechanisms and multi-step inference, presents a significant bottleneck. To address this challenge, we propose TokenCache, a novel post-training acceleration method that leverages the token-based multi-block architecture of transformers to reduce redundant computations among tokens across inference steps. TokenCache specifically addresses three critical questions in the context of diffusion transformers: (1) which tokens should be pruned to eliminate redundancy, (2) which blocks should be targeted for efficient pruning, and (3) at which time steps caching should be applied to balance speed and quality. In response to these challenges, TokenCache introduces a Cache Predictor that assigns importance scores to tokens, enabling selective pruning without compromising model performance. Furthermore, we propose an adaptive block selection strategy to focus on blocks with minimal impact on the network’s output, along with a Two-Phase Round-Robin (TPRR) scheduling policy to optimize caching intervals throughout the denoising process. Experimental results across various models demonstrate that TokenCache achieves an effective trade-off between generation quality and inference speed for diffusion transformers. Our code will be publicly available.

[LG-51] Fairness-aware Multiobjective Evolutionary Learning

链接: https://arxiv.org/abs/2409.18499
作者: Qingquan Zhang,Jialin Liu,Xin Yao
关键词-EN: Multiobjective evolutionary learning, fairer machine learning, Multiobjective evolutionary, machine learning models, training fairer machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages

点击查看摘要

Abstract:Multiobjective evolutionary learning (MOEL) has demonstrated its advantages of training fairer machine learning models considering a predefined set of conflicting objectives, including accuracy and different fairness measures. Recent works propose to construct a representative subset of fairness measures as optimisation objectives of MOEL throughout model training. However, the determination of a representative measure set relies on dataset, prior knowledge and requires substantial computational costs. What’s more, those representative measures may differ across different model training processes. Instead of using a static predefined set determined before model training, this paper proposes to dynamically and adaptively determine a representative measure set online during model training. The dynamically determined representative set is then used as optimising objectives of the MOEL framework and can vary with time. Extensive experimental results on 12 well-known benchmark datasets demonstrate that our proposed framework achieves outstanding performance compared to state-of-the-art approaches for mitigating unfairness in terms of accuracy as well as 25 fairness measures although only a few of them were dynamically selected and used as optimisation objectives. The results indicate the importance of setting optimisation objectives dynamically during training.

[LG-52] reating Brain-inspired Memories as Priors for Diffusion Model to Forecast Multivariate Time Series

链接: https://arxiv.org/abs/2409.18491
作者: Muyao Wang,Wenchao Chen,Zhibin Duan,Bo Chen
关键词-EN: Forecasting Multivariate Time, Multivariate Time Series, Forecasting Multivariate, Time Series, Multivariate Time
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Forecasting Multivariate Time Series (MTS) involves significant challenges in various application domains. One immediate challenge is modeling temporal patterns with the finite length of the input. These temporal patterns usually involve periodic and sudden events that recur across different channels. To better capture temporal patterns, we get inspiration from humans’ memory mechanisms and propose a channel-shared, brain-inspired memory module for MTS. Specifically, brain-inspired memory comprises semantic and episodic memory, where the former is used to capture general patterns, such as periodic events, and the latter is employed to capture special patterns, such as sudden events, respectively. Meanwhile, we design corresponding recall and update mechanisms to better utilize these patterns. Furthermore, acknowledging the capacity of diffusion models to leverage memory as a prior, we present a brain-inspired memory-augmented diffusion model. This innovative model retrieves relevant memories for different channels, utilizing them as distinct priors for MTS predictions. This incorporation significantly enhances the accuracy and robustness of predictions. Experimental results on eight datasets consistently validate the superiority of our approach in capturing and leveraging diverse recurrent temporal patterns across different channels.

[LG-53] HSTFL: A Heterogeneous Federated Learning Framework for Misaligned Spatiotemporal Forecasting

链接: https://arxiv.org/abs/2409.18482
作者: Shuowei Cai,Hao Liu
关键词-EN: smart city applications, smart energy management, diverse smart city, indispensable building block, city applications
类目: Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Spatiotemporal forecasting has emerged as an indispensable building block of diverse smart city applications, such as intelligent transportation and smart energy management. Recent advancements have uncovered that the performance of spatiotemporal forecasting can be significantly improved by integrating knowledge in geo-distributed time series data from different domains, \eg enhancing real-estate appraisal with human mobility data; joint taxi and bike demand predictions. While effective, existing approaches assume a centralized data collection and exploitation environment, overlooking the privacy and commercial interest concerns associated with data owned by different parties. In this paper, we investigate multi-party collaborative spatiotemporal forecasting without direct access to multi-source private data. However, this task is challenging due to 1) cross-domain feature heterogeneity and 2) cross-client geographical heterogeneity, where standard horizontal or vertical federated learning is inapplicable. To this end, we propose a Heterogeneous SpatioTemporal Federated Learning (HSTFL) framework to enable multiple clients to collaboratively harness geo-distributed time series data from different domains while preserving privacy. Specifically, we first devise vertical federated spatiotemporal representation learning to locally preserve spatiotemporal dependencies among individual participants and generate effective representations for heterogeneous data. Then we propose a cross-client virtual node alignment block to incorporate cross-client spatiotemporal dependencies via a multi-level knowledge fusion scheme. Extensive privacy analysis and experimental evaluations demonstrate that HSTFL not only effectively resists inference attacks but also provides a significant improvement against various baselines.

[LG-54] Deep Heterogeneous Contrastive Hyper-Graph Learning for In-the-Wild Context-Aware Human Activity Recognition

链接: https://arxiv.org/abs/2409.18481
作者: Wen Ge,Guanyi Mou,Emmanuel O. Agu,Kyumin Lee
关键词-EN: Human Activity Recognition, multi-label classification problem, Human Activity, Activity Recognition, multi-label classification
类目: Machine Learning (cs.LG)
*备注: IMWUT 2023

点击查看摘要

Abstract:Human Activity Recognition (HAR) is a challenging, multi-label classification problem as activities may co-occur and sensor signals corresponding to the same activity may vary in different contexts (e.g., different device placements). This paper proposes a Deep Heterogeneous Contrastive Hyper-Graph Learning (DHC-HGL) framework that captures heterogenous Context-Aware HAR (CA-HAR) hypergraph properties in a message-passing and neighborhood-aggregation fashion. Prior work only explored homogeneous or shallow-node-heterogeneous graphs. DHC-HGL handles heterogeneous CA-HAR data by innovatively 1) Constructing three different types of sub-hypergraphs that are each passed through different custom HyperGraph Convolution (HGC) layers designed to handle edge-heterogeneity and 2) Adopting a contrastive loss function to ensure node-heterogeneity. In rigorous evaluation on two CA-HAR datasets, DHC-HGL significantly outperformed state-of-the-art baselines by 5.8% to 16.7% on Matthews Correlation Coefficient (MCC) and 3.0% to 8.4% on Macro F1 scores. UMAP visualizations of learned CA-HAR node embeddings are also presented to enhance model explainability.

[LG-55] CycleNet: Enhancing Time Series Forecasting through Modeling Periodic Patterns

链接: https://arxiv.org/abs/2409.18479
作者: Shengsheng Lin,Weiwei Lin,Xinyi Hu,Wentai Wu,Ruichao Mo,Haocheng Zhong
关键词-EN:
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-56] URIEL: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base

链接: https://arxiv.org/abs/2409.18472
作者: Aditya Khan,Mason Shipton,David Anugraha,Kaiyao Duan,Phuong H. Hoang,Eric Khiu,A. Seza Doğruöz,En-Shiun Annie Lee
关键词-EN: base offering geographical, knowledge base offering, offering geographical, languages, knowledge base
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:URIEL is a knowledge base offering geographical, phylogenetic, and typological vector representations for 7970 languages. It includes distance measures between these vectors for 4005 languages, which are accessible via the lang2vec tool. Despite being frequently cited, URIEL is limited in terms of linguistic inclusion and overall usability. To tackle these challenges, we introduce URIEL+, an enhanced version of URIEL and lang2vec addressing these limitations. In addition to expanding typological feature coverage for 2898 languages, URIEL+ improves user experience with robust, customizable distance calculations to better suit the needs of the users. These upgrades also offer competitive performance on downstream tasks and provide distances that better align with linguistic distance studies.

[LG-57] Fairness without Sensitive Attributes via Knowledge Sharing

链接: https://arxiv.org/abs/2409.18470
作者: Hongliang Ni,Lei Han,Tong Chen,Shazia Sadiq,Gianluca Demartini
关键词-EN: existing methods invariably, methods invariably rely, adjusting explicit sensitive, explicit sensitive attribute, explored previously
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While model fairness improvement has been explored previously, existing methods invariably rely on adjusting explicit sensitive attribute values in order to improve model fairness in downstream tasks. However, we observe a trend in which sensitive demographic information becomes inaccessible as public concerns around data privacy grow. In this paper, we propose a confidence-based hierarchical classifier structure called “Reckoner” for reliable fair model learning under the assumption of missing sensitive attributes. We first present results showing that if the dataset contains biased labels or other hidden biases, classifiers significantly increase the bias gap across different demographic groups in the subset with higher prediction confidence. Inspired by these findings, we devised a dual-model system in which a version of the model initialised with a high-confidence data subset learns from a version of the model initialised with a low-confidence data subset, enabling it to avoid biased predictions. Our experimental results show that Reckoner consistently outperforms state-of-the-art baselines in COMPAS dataset and New Adult dataset, considering both accuracy and fairness metrics.

[LG-58] A TextGCN-Based Decoding Approach for Improving Remote Sensing Image Captioning

链接: https://arxiv.org/abs/2409.18467
作者: Swadhin Das,Raksha Sharma
关键词-EN: address complex real-world, complex real-world issues, Remote sensing images, Remote sensing, risk management
类目: Machine Learning (cs.LG)
*备注: Under Review

点击查看摘要

Abstract:Remote sensing images are highly valued for their ability to address complex real-world issues such as risk management, security, and meteorology. However, manually captioning these images is challenging and requires specialized knowledge across various domains. This letter presents an approach for automatically describing (captioning) remote sensing images. We propose a novel encoder-decoder setup that deploys a Text Graph Convolutional Network (TextGCN) and multi-layer LSTMs. The embeddings generated by TextGCN enhance the decoder’s understanding by capturing the semantic relationships among words at both the sentence and corpus levels. Furthermore, we advance our approach with a comparison-based beam search method to ensure fairness in the search strategy for generating the final caption. We present an extensive evaluation of our approach against various other state-of-the-art encoder-decoder frameworks. We evaluated our method across three datasets using seven metrics: BLEU-1 to BLEU-4, METEOR, ROUGE-L, and CIDEr. The results demonstrate that our approach significantly outperforms other state-of-the-art encoder-decoder methods.

[LG-59] Latent Representation Learning for Multimodal Brain Activity Translation

链接: https://arxiv.org/abs/2409.18462
作者: Arman Afrasiyabi,Dhananjay Bhaskar,Erica L. Busch,Laurent Caplette,Rahul Singh,Guillaume Lajoie,Nicholas B. Turk-Browne,Smita Krishnaswamy
关键词-EN: diverse neuroimaging techniques, employs diverse neuroimaging, offering distinct insights, increased spatial precision, Neuroscience employs diverse
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Neuroscience employs diverse neuroimaging techniques, each offering distinct insights into brain activity, from electrophysiological recordings such as EEG, which have high temporal resolution, to hemodynamic modalities such as fMRI, which have increased spatial precision. However, integrating these heterogeneous data sources remains a challenge, which limits a comprehensive understanding of brain function. We present the Spatiotemporal Alignment of Multimodal Brain Activity (SAMBA) framework, which bridges the spatial and temporal resolution gaps across modalities by learning a unified latent space free of modality-specific biases. SAMBA introduces a novel attention-based wavelet decomposition for spectral filtering of electrophysiological recordings, graph attention networks to model functional connectivity between functional brain units, and recurrent layers to capture temporal autocorrelations in brain signal. We show that the training of SAMBA, aside from achieving translation, also learns a rich representation of brain information processing. We showcase this classify external stimuli driving brain activity from the representation learned in hidden layers of SAMBA, paving the way for broad downstream applications in neuroscience research and clinical contexts.

[LG-60] owards Diverse Device Heterogeneous Federated Learning via Task Arithmetic Knowledge Integration NEURIPS2024

链接: https://arxiv.org/abs/2409.18461
作者: Mahdi Morafah,Vyacheslav Kungurtsev,Hojin Chang,Chen Chen,Bill Lin
关键词-EN: user data privacy, preserving user data, collaborative machine learning, Federated Learning, data privacy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Federated Learning has emerged as a promising paradigm for collaborative machine learning, while preserving user data privacy. Despite its potential, standard FL lacks support for diverse heterogeneous device prototypes, which vary significantly in model and dataset sizes – from small IoT devices to large workstations. This limitation is only partially addressed by existing knowledge distillation techniques, which often fail to transfer knowledge effectively across a broad spectrum of device prototypes with varied capabilities. This failure primarily stems from two issues: the dilution of informative logits from more capable devices by those from less capable ones, and the use of a single integrated logits as the distillation target across all devices, which neglects their individual learning capacities and and the unique contributions of each. To address these challenges, we introduce TAKFL, a novel KD-based framework that treats the knowledge transfer from each device prototype’s ensemble as a separate task, independently distilling each to preserve its unique contributions and avoid dilution. TAKFL also incorporates a KD-based self-regularization technique to mitigate the issues related to the noisy and unsupervised ensemble distillation process. To integrate the separately distilled knowledge, we introduce an adaptive task arithmetic knowledge integration process, allowing each student model to customize the knowledge integration for optimal performance. Additionally, we present theoretical results demonstrating the effectiveness of task arithmetic in transferring knowledge across heterogeneous devices with varying capacities. Comprehensive evaluations of our method across both CV and NLP tasks demonstrate that TAKFL achieves SOTA results in a variety of datasets and settings, significantly outperforming existing KD-based methods. Code is released at this https URL

[LG-61] Review of Digital Asset Development with Graph Neural Network Unlearning

链接: https://arxiv.org/abs/2409.18455
作者: Zara Lisbon
关键词-EN: rapidly evolving landscape, Graph Neural Networks, robust data privacy, rapidly evolving, evolving landscape
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the rapidly evolving landscape of digital assets, the imperative for robust data privacy and compliance with regulatory frameworks has intensified. This paper investigates the critical role of Graph Neural Networks (GNNs) in the management of digital assets and introduces innovative unlearning techniques specifically tailored to GNN architectures. We categorize unlearning strategies into two primary classes: data-driven approximation, which manipulates the graph structure to isolate and remove the influence of specific nodes, and model-driven approximation, which modifies the internal parameters and architecture of the GNN itself. By examining recent advancements in these unlearning methodologies, we highlight their applicability in various use cases, including fraud detection, risk assessment, token relationship prediction, and decentralized governance. We discuss the challenges inherent in balancing model performance with the requirements for data unlearning, particularly in the context of real-time financial applications. Furthermore, we propose a hybrid approach that combines the strengths of both unlearning strategies to enhance the efficiency and effectiveness of GNNs in digital asset ecosystems. Ultimately, this paper aims to provide a comprehensive framework for understanding and implementing GNN unlearning techniques, paving the way for secure and compliant deployment of machine learning in the digital asset domain.

[LG-62] Hierarchical Federated Learning with Multi-Timescale Gradient Correction NEURIPS2024

链接: https://arxiv.org/abs/2409.18448
作者: Wenzhi Fang,Dong-Jun Han,Evan Chen,Shiqiang Wang,Christopher G. Brinton
关键词-EN:
类目: Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2024

点击查看摘要

[LG-63] Gradient-free Decoder Inversion in Latent Diffusion Models NEURIPS2024

链接: https://arxiv.org/abs/2409.18442
作者: Seongmin Hong,Suh Yoon Jeon,Kyeonghyun Lee,Ernest K. Ryu,Se Young Chun
关键词-EN: denoising diffusion process, diffusion process efficiently, denoising diffusion, diffusion process, process efficiently
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 19 pages, Accepted to NeurIPS 2024

点击查看摘要

Abstract:In latent diffusion models (LDMs), denoising diffusion process efficiently takes place on latent space whose dimension is lower than that of pixel space. Decoder is typically used to transform the representation in latent space to that in pixel space. While a decoder is assumed to have an encoder as an accurate inverse, exact encoder-decoder pair rarely exists in practice even though applications often require precise inversion of decoder. Prior works for decoder inversion in LDMs employed gradient descent inspired by inversions of generative adversarial networks. However, gradient-based methods require larger GPU memory and longer computation time for larger latent space. For example, recent video LDMs can generate more than 16 frames, but GPUs with 24 GB memory can only perform gradient-based decoder inversion for 4 frames. Here, we propose an efficient gradient-free decoder inversion for LDMs, which can be applied to diverse latent models. Theoretical convergence property of our proposed inversion has been investigated not only for the forward step method, but also for the inertial Krasnoselskii-Mann (KM) iterations under mild assumption on cocoercivity that is satisfied by recent LDMs. Our proposed gradient-free method with Adam optimizer and learning rate scheduling significantly reduced computation time and memory usage over prior gradient-based methods and enabled efficient computation in applications such as noise-space watermarking while achieving comparable error levels.

[LG-64] State-free Reinforcement Learning

链接: https://arxiv.org/abs/2409.18439
作者: Mingyu Chen,Aldo Pacchiano,Xuezhou Zhang
关键词-EN: reachable state set, states information, state space, textit, information before interacting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work, we study the \textitstate-free RL problem, where the algorithm does not have the states information before interacting with the environment. Specifically, denote the reachable state set by S^\Pi := \ s|\max_\pi\in \Piq^P, \pi(s)0 \ , we design an algorithm which requires no information on the state space S while having a regret that is completely independent of S and only depend on S^\Pi . We view this as a concrete first step towards \textitparameter-free RL, with the goal of designing RL algorithms that require no hyper-parameter tuning.

[LG-65] Multi-agent Reinforcement Learning for Dynamic Dispatching in Material Handling Systems

链接: https://arxiv.org/abs/2409.18435
作者: Xian Yeow Lee,Haiyan Wang,Daisuke Katsumata,Takaharu Matsui,Chetan Gupta
关键词-EN: multi-agent reinforcement learning, material handling systems, diverse industries, material handling, multi-agent reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:This paper proposes a multi-agent reinforcement learning (MARL) approach to learn dynamic dispatching strategies, which is crucial for optimizing throughput in material handling systems across diverse industries. To benchmark our method, we developed a material handling environment that reflects the complexities of an actual system, such as various activities at different locations, physical constraints, and inherent uncertainties. To enhance exploration during learning, we propose a method to integrate domain knowledge in the form of existing dynamic dispatching heuristics. Our experimental results show that our method can outperform heuristics by up to 7.4 percent in terms of median throughput. Additionally, we analyze the effect of different architectures on MARL performance when training multiple agents with different functions. We also demonstrate that the MARL agents performance can be further improved by using the first iteration of MARL agents as heuristics to train a second iteration of MARL agents. This work demonstrates the potential of applying MARL to learn effective dynamic dispatching strategies that may be deployed in real-world systems to improve business outcomes.

[LG-66] Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization NEURIPS2024

链接: https://arxiv.org/abs/2409.18433
作者: Mucong Ding,Chenghao Deng,Jocelyn Choo,Zichu Wu,Aakriti Agrawal,Avi Schwarzschild,Tianyi Zhou,Tom Goldstein,John Langford,Anima Anandkumar,Furong Huang
关键词-EN: profile language models, fine-grained difficulty annotations, tasks from easy, easy to hard, hard is crucial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: NeurIPS 2024 Datasets and Benchmarks Track

点击查看摘要

Abstract:While generalization over tasks from easy to hard is crucial to profile language models (LLMs), the datasets with fine-grained difficulty annotations for each problem across a broad range of complexity are still blank. Aiming to address this limitation, we present Easy2Hard-Bench, a consistently formatted collection of 6 benchmark datasets spanning various domains, such as mathematics and programming problems, chess puzzles, and reasoning questions. Each problem within these datasets is annotated with numerical difficulty scores. To systematically estimate problem difficulties, we collect abundant performance data on attempts to each problem by humans in the real world or LLMs on the prominent leaderboard. Leveraging the rich performance data, we apply well-established difficulty ranking systems, such as Item Response Theory (IRT) and Glicko-2 models, to uniformly assign numerical difficulty scores to problems. Moreover, datasets in Easy2Hard-Bench distinguish themselves from previous collections by a higher proportion of challenging problems. Through extensive experiments with six state-of-the-art LLMs, we provide a comprehensive analysis of their performance and generalization capabilities across varying levels of difficulty, with the aim of inspiring future research in LLM generalization. The datasets are available at this https URL.

[LG-67] Neural Collaborative Filtering to Detect Anomalies in Human Semantic Trajectories

链接: https://arxiv.org/abs/2409.18427
作者: Yueyang Liu,Lance Kennedy,Hossein Amiri,Andreas Züfle
关键词-EN: including security surveillance, trajectory anomaly detection, trajectory anomaly, anomaly detection, range of applications
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注: Accepted for publication in the 1st ACM SIGSPATIAL International Workshop on Geospatial Anomaly Detection (GeoAnomalies’24)

点击查看摘要

Abstract:Human trajectory anomaly detection has become increasingly important across a wide range of applications, including security surveillance and public health. However, existing trajectory anomaly detection methods are primarily focused on vehicle-level traffic, while human-level trajectory anomaly detection remains under-explored. Since human trajectory data is often very sparse, machine learning methods have become the preferred approach for identifying complex patterns. However, concerns regarding potential biases and the robustness of these models have intensified the demand for more transparent and explainable alternatives. In response to these challenges, our research focuses on developing a lightweight anomaly detection model specifically designed to detect anomalies in human trajectories. We propose a Neural Collaborative Filtering approach to model and predict normal mobility. Our method is designed to model users’ daily patterns of life without requiring prior knowledge, thereby enhancing performance in scenarios where data is sparse or incomplete, such as in cold start situations. Our algorithm consists of two main modules. The first is the collaborative filtering module, which applies collaborative filtering to model normal mobility of individual humans to places of interest. The second is the neural module, responsible for interpreting the complex spatio-temporal relationships inherent in human trajectory data. To validate our approach, we conducted extensive experiments using simulated and real-world datasets comparing to numerous state-of-the-art trajectory anomaly detection approaches.

[LG-68] Dual Cone Gradient Descent for Training Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2409.18426
作者: Youngsik Hwang,Dong-Young Lim
关键词-EN: Physics-informed neural networks, solving partial differential, partial differential equations, Physics-informed neural, combined loss function
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP); Numerical Analysis (math.NA); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) have emerged as a prominent approach for solving partial differential equations (PDEs) by minimizing a combined loss function that incorporates both boundary loss and PDE residual loss. Despite their remarkable empirical performance in various scientific computing tasks, PINNs often fail to generate reasonable solutions, and such pathological behaviors remain difficult to explain and resolve. In this paper, we identify that PINNs can be adversely trained when gradients of each loss function exhibit a significant imbalance in their magnitudes and present a negative inner product value. To address these issues, we propose a novel optimization framework, Dual Cone Gradient Descent (DCGD), which adjusts the direction of the updated gradient to ensure it falls within a dual cone region. This region is defined as a set of vectors where the inner products with both the gradients of the PDE residual loss and the boundary loss are non-negative. Theoretically, we analyze the convergence properties of DCGD algorithms in a non-convex setting. On a variety of benchmark equations, we demonstrate that DCGD outperforms other optimization algorithms in terms of various evaluation metrics. In particular, DCGD achieves superior predictive accuracy and enhances the stability of training for failure modes of PINNs and complex PDEs, compared to existing optimally tuned models. Moreover, DCGD can be further improved by combining it with popular strategies for PINNs, including learning rate annealing and the Neural Tangent Kernel (NTK).

[LG-69] A physics-driven sensor placement optimization methodology for temperature field reconstruction

链接: https://arxiv.org/abs/2409.18423
作者: Xu Liu,Wen Yao,Wei Peng,Zhuojia Fu,Zixue Xiang,Xiaoqian Chen
关键词-EN: Perceiving the global, sensor placement optimization, physical systems, grand challenge, design of physical
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Perceiving the global field from sparse sensors has been a grand challenge in the monitoring, analysis, and design of physical systems. In this context, sensor placement optimization is a crucial issue. Most existing works require large and sufficient data to construct data-based criteria, which are intractable in data-free scenarios without numerical and experimental data. To this end, we propose a novel physics-driven sensor placement optimization (PSPO) method for temperature field reconstruction using a physics-based criterion to optimize sensor locations. In our methodological framework, we firstly derive the theoretical upper and lower bounds of the reconstruction error under noise scenarios by analyzing the optimal solution, proving that error bounds correlate with the condition number determined by sensor locations. Furthermore, the condition number, as the physics-based criterion, is used to optimize sensor locations by the genetic algorithm. Finally, the best sensors are validated by reconstruction models, including non-invasive end-to-end models, non-invasive reduced-order models, and physics-informed models. Experimental results, both on a numerical and an application case, demonstrate that the PSPO method significantly outperforms random and uniform selection methods, improving the reconstruction accuracy by nearly an order of magnitude. Moreover, the PSPO method can achieve comparable reconstruction accuracy to the existing data-driven placement optimization methods.

[LG-70] Robust Network Learning via Inverse Scale Variational Sparsification

链接: https://arxiv.org/abs/2409.18419
作者: Zhiling Zhou,Zirui Liu,Chengming Xu,Yanwei Fu,Xinwei Sun
关键词-EN: made significant strides, including natural corruptions, noise types, including natural, low-resolution artifacts
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 21 pages, 7 figures

点击查看摘要

Abstract:While neural networks have made significant strides in many AI tasks, they remain vulnerable to a range of noise types, including natural corruptions, adversarial noise, and low-resolution artifacts. Many existing approaches focus on enhancing robustness against specific noise types, limiting their adaptability to others. Previous studies have addressed general robustness by adopting a spectral perspective, which tends to blur crucial features like texture and object contours. Our proposed solution, however, introduces an inverse scale variational sparsification framework within a time-continuous inverse scale space formulation. This framework progressively learns finer-scale features by discerning variational differences between pixels, ultimately preserving only large-scale features in the smoothed image. Unlike frequency-based methods, our approach not only removes noise by smoothing small-scale features where corruptions often occur but also retains high-contrast details such as textures and object contours. Moreover, our framework offers simplicity and efficiency in implementation. By integrating this algorithm into neural network training, we guide the model to prioritize learning large-scale features. We show the efficacy of our approach through enhanced robustness against various noise types.

[LG-71] A3: Active Adversarial Alignment for Source-Free Domain Adaptation ICML

链接: https://arxiv.org/abs/2409.18418
作者: Chrisantus Eze,Christopher Crick
关键词-EN: Unsupervised domain adaptation, aims to transfer, source-free UDA, Unsupervised domain, transfer knowledge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ICMLA 2024

点击查看摘要

Abstract:Unsupervised domain adaptation (UDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain. Recent works have focused on source-free UDA, where only target data is available. This is challenging as models rely on noisy pseudo-labels and struggle with distribution shifts. We propose Active Adversarial Alignment (A3), a novel framework combining self-supervised learning, adversarial training, and active learning for robust source-free UDA. A3 actively samples informative and diverse data using an acquisition function for training. It adapts models via adversarial losses and consistency regularization, aligning distributions without source data access. A3 advances source-free UDA through its synergistic integration of active and adversarial learning for effective domain alignment and noise reduction.

[LG-72] VickreyFeedback: Cost-efficient Data Construction for Reinforcement Learning from Human Feedback

链接: https://arxiv.org/abs/2409.18417
作者: Guoxi Zhang,Jiuding Duan
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); General Economics (econ.GN)
*备注: 16 pages, 5 figures

点击查看摘要

[LG-73] Embed and Emulate: Contrastive representations for simulation-based inference

链接: https://arxiv.org/abs/2409.18402
作者: Ruoxi Jiang,Peter Y. Lu,Rebecca Willett
关键词-EN: engineering applications rely, applications rely heavily, Scientific modeling, fit physical models, real-world measurements
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Scientific modeling and engineering applications rely heavily on parameter estimation methods to fit physical models and calibrate numerical simulations using real-world measurements. In the absence of analytic statistical models with tractable likelihoods, modern simulation-based inference (SBI) methods first use a numerical simulator to generate a dataset of parameters and simulated outputs. This dataset is then used to approximate the likelihood and estimate the system parameters given observation data. Several SBI methods employ machine learning emulators to accelerate data generation and parameter estimation. However, applying these approaches to high-dimensional physical systems remains challenging due to the cost and complexity of training high-dimensional emulators. This paper introduces Embed and Emulate (EE): a new SBI method based on contrastive learning that efficiently handles high-dimensional data and complex, multimodal parameter posteriors. EE learns a low-dimensional latent embedding of the data (i.e., a summary statistic) and a corresponding fast emulator in the latent space, eliminating the need to run expensive simulations or a high dimensional emulator during inference. We illustrate the theoretical properties of the learned latent space through a synthetic experiment and demonstrate superior performance over existing methods in a realistic, non-identifiable parameter estimation task using the high-dimensional, chaotic Lorenz 96 system.

[LG-74] CurricuLLM: Automatic Task Curricula Design for Learning Complex Robot Skills using Large Language Models ICRA2025

链接: https://arxiv.org/abs/2409.18382
作者: Kanghyun Ryu,Qiayuan Liao,Zhongyu Li,Koushil Sreenath,Negar Mehr
关键词-EN: mechanism in reinforcement, facilitates the achievement, progressively increasing, translating natural language, Step
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Submitted to ICRA 2025

点击查看摘要

Abstract:Curriculum learning is a training mechanism in reinforcement learning (RL) that facilitates the achievement of complex policies by progressively increasing the task difficulty during training. However, designing effective curricula for a specific task often requires extensive domain knowledge and human intervention, which limits its applicability across various domains. Our core idea is that large language models (LLMs), with their extensive training on diverse language data and ability to encapsulate world knowledge, present significant potential for efficiently breaking down tasks and decomposing skills across various robotics environments. Additionally, the demonstrated success of LLMs in translating natural language into executable code for RL agents strengthens their role in generating task curricula. In this work, we propose CurricuLLM, which leverages the high-level planning and programming capabilities of LLMs for curriculum design, thereby enhancing the efficient learning of complex target tasks. CurricuLLM consists of: (Step 1) Generating sequence of subtasks that aid target task learning in natural language form, (Step 2) Translating natural language description of subtasks in executable task code, including the reward code and goal distribution code, and (Step 3) Evaluating trained policies based on trajectory rollout and subtask description. We evaluate CurricuLLM in various robotics simulation environments, ranging from manipulation, navigation, and locomotion, to show that CurricuLLM can aid learning complex robot control tasks. In addition, we validate humanoid locomotion policy learned through CurricuLLM in real-world. The code is provided in this https URL

[LG-75] Discovery and inversion of the viscoelastic wave equation in inhomogeneous media

链接: https://arxiv.org/abs/2409.18370
作者: Su Chen,Yi Ding,Hiroe Miyake,Xiaojun Li
关键词-EN: scientific machine learning, identifying partial differential, partial differential equations, differential equations accurately, machine learning
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:In scientific machine learning, the task of identifying partial differential equations accurately from sparse and noisy data poses a significant challenge. Current sparse regression methods may identify inaccurate equations on sparse and noisy datasets and are not suitable for varying coefficients. To address this issue, we propose a hybrid framework that combines two alternating direction optimization phases: discovery and embedding. The discovery phase employs current well-developed sparse regression techniques to preliminarily identify governing equations from observations. The embedding phase implements a recurrent convolutional neural network (RCNN), enabling efficient processes for time-space iterations involved in discretized forms of wave equation. The RCNN model further optimizes the imperfect sparse regression results to obtain more accurate functional terms and coefficients. Through alternating update of discovery-embedding phases, essential physical equations can be robustly identified from noisy and low-resolution measurements. To assess the performance of proposed framework, numerical experiments are conducted on various scenarios involving wave equation in elastic/viscoelastic and homogeneous/inhomogeneous media. The results demonstrate that the proposed method exhibits excellent robustness and accuracy, even when faced with high levels of noise and limited data availability in both spatial and temporal domains.

[LG-76] Defect Prediction with Content-based Features

链接: https://arxiv.org/abs/2409.18365
作者: Hung Viet Pham,Tung Thanh Nguyen
关键词-EN: Traditional defect prediction, Traditional defect, design or implementing, number of lines, source code
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional defect prediction approaches often use metrics that measure the complexity of the design or implementing code of a software system, such as the number of lines of code in a source file. In this paper, we explore a different approach based on content of source code. Our key assumption is that source code of a software system contains information about its technical aspects and those aspects might have different levels of defect-proneness. Thus, content-based features such as words, topics, data types, and package names extracted from a source code file could be used to predict its defects. We have performed an extensive empirical evaluation and found that: i) such content-based features have higher predictive power than code complexity metrics and ii) the use of feature selection, reduction, and combination further improves the prediction performance.

[LG-77] Multi-hypotheses Conditioned Point Cloud Diffusion for 3D Human Reconstruction from Occluded Images NEURIPS2024

链接: https://arxiv.org/abs/2409.18364
作者: Donghwan Kim,Tae-Kyun Kim
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 17 pages, 7 figures, accepted NeurIPS 2024

点击查看摘要

[LG-78] Generative AI for fast and accurate Statistical Computation of Fluids

链接: https://arxiv.org/abs/2409.18359
作者: Roberto Molinaro,Samuel Lanthaler,Bogdan Raonić,Tobias Rohner,Victor Armegioiu,Zhong Yi Wan,Fei Sha,Siddhartha Mishra,Leonardo Zepeda-Núñez
关键词-EN: robust statistical computation, turbulent fluid flows, three-dimensional turbulent fluid, fluid flows, task of fast
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Fluid Dynamics (physics.flu-dyn)
*备注: 71 pages, 30 figures

点击查看摘要

Abstract:We present a generative AI algorithm for addressing the challenging task of fast, accurate and robust statistical computation of three-dimensional turbulent fluid flows. Our algorithm, termed as GenCFD, is based on a conditional score-based diffusion model. Through extensive numerical experimentation with both incompressible and compressible fluid flows, we demonstrate that GenCFD provides very accurate approximation of statistical quantities of interest such as mean, variance, point pdfs, higher-order moments, while also generating high quality realistic samples of turbulent fluid flows and ensuring excellent spectral resolution. In contrast, ensembles of operator learning baselines which are trained to minimize mean (absolute) square errors regress to the mean flow. We present rigorous theoretical results uncovering the surprising mechanisms through which diffusion models accurately generate fluid flows. These mechanisms are illustrated with solvable toy models that exhibit the relevant features of turbulent fluid flows while being amenable to explicit analytical formulas.

[LG-79] FedDCL: a federated data collaboration learning as a hybrid-type privacy-preserving framework based on federated learning and data collaboration

链接: https://arxiv.org/abs/2409.18356
作者: Akira Imakura,Tetsuya Sakurai
关键词-EN: privacy-preserving integrated analysis, enables integrated analysis, sharing raw data, federated learning, integrated analysis
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 18 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Recently, federated learning has attracted much attention as a privacy-preserving integrated analysis that enables integrated analysis of data held by multiple institutions without sharing raw data. On the other hand, federated learning requires iterative communication across institutions and has a big challenge for implementation in situations where continuous communication with the outside world is extremely difficult. In this study, we propose a federated data collaboration learning (FedDCL), which solves such communication issues by combining federated learning with recently proposed non-model share-type federated learning named as data collaboration analysis. In the proposed FedDCL framework, each user institution independently constructs dimensionality-reduced intermediate representations and shares them with neighboring institutions on intra-group DC servers. On each intra-group DC server, intermediate representations are transformed to incorporable forms called collaboration representations. Federated learning is then conducted between intra-group DC servers. The proposed FedDCL framework does not require iterative communication by user institutions and can be implemented in situations where continuous communication with the outside world is extremely difficult. The experimental results show that the performance of the proposed FedDCL is comparable to that of existing federated learning.

[LG-80] Benchmarking Graph Conformal Prediction: Empirical Analysis Scalability and Theoretical Insights

链接: https://arxiv.org/abs/2409.18332
作者: Pranav Maneriker,Aditya T. Vadlamani,Anutam Srinivasan,Yuntian He,Ali Payani,Srinivasan Parthasarathy
关键词-EN: machine learning models, learning models, increasingly popular, popular for quantifying, machine learning
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Conformal prediction has become increasingly popular for quantifying the uncertainty associated with machine learning models. Recent work in graph uncertainty quantification has built upon this approach for conformal graph prediction. The nascent nature of these explorations has led to conflicting choices for implementations, baselines, and method evaluation. In this work, we analyze the design choices made in the literature and discuss the tradeoffs associated with existing methods. Building on the existing implementations for existing methods, we introduce techniques to scale existing methods to large-scale graph datasets without sacrificing performance. Our theoretical and empirical results justify our recommendations for future scholarship in graph conformal prediction.

[LG-81] DMC-VB: A Benchmark for Representation Learning for Control with Visual Distractors NEURIPS2024

链接: https://arxiv.org/abs/2409.18330
作者: Joseph Ortiz,Antoine Dedieu,Wolfgang Lehrach,Swaroop Guntupalli,Carter Wendelken,Ahmad Humayun,Guangyao Zhou,Sivaramakrishnan Swaminathan,Miguel Lázaro-Gredilla,Kevin Murphy
关键词-EN: scaling generalist agents, expensive online learning, behavioral cloning, powerful recipe, recipe for scaling
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2024 Datasets and Benchmarks Track. Dataset available at: this https URL

点击查看摘要

Abstract:Learning from previously collected data via behavioral cloning or offline reinforcement learning (RL) is a powerful recipe for scaling generalist agents by avoiding the need for expensive online learning. Despite strong generalization in some respects, agents are often remarkably brittle to minor visual variations in control-irrelevant factors such as the background or camera viewpoint. In this paper, we present theDeepMind Control Visual Benchmark (DMC-VB), a dataset collected in the DeepMind Control Suite to evaluate the robustness of offline RL agents for solving continuous control tasks from visual input in the presence of visual distractors. In contrast to prior works, our dataset (a) combines locomotion and navigation tasks of varying difficulties, (b) includes static and dynamic visual variations, © considers data generated by policies with different skill levels, (d) systematically returns pairs of state and pixel observation, (e) is an order of magnitude larger, and (f) includes tasks with hidden goals. Accompanying our dataset, we propose three benchmarks to evaluate representation learning methods for pretraining, and carry out experiments on several recently proposed methods. First, we find that pretrained representations do not help policy learning on DMC-VB, and we highlight a large representation gap between policies learned on pixel observations and on states. Second, we demonstrate when expert data is limited, policy learning can benefit from representations pretrained on (a) suboptimal data, and (b) tasks with stochastic hidden goals. Our dataset and benchmark code to train and evaluate agents are available at: this https URL.

[LG-82] owards the Mitigation of Confirmation Bias in Semi-supervised Learning: a Debiased Training Perspective

链接: https://arxiv.org/abs/2409.18316
作者: Yu Wang,Yuxuan Yin,Peng Li
关键词-EN: commonly exhibits confirmation, models disproportionately favor, exhibits confirmation bias, Semi-supervised learning, commonly exhibits
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 11 pages, 4 figures

点击查看摘要

Abstract:Semi-supervised learning (SSL) commonly exhibits confirmation bias, where models disproportionately favor certain classes, leading to errors in predicted pseudo labels that accumulate under a self-training paradigm. Unlike supervised settings, which benefit from a rich, static data distribution, SSL inherently lacks mechanisms to correct this self-reinforced bias, necessitating debiased interventions at each training step. Although the generation of debiased pseudo labels has been extensively studied, their effective utilization remains underexplored. Our analysis indicates that data from biased classes should have a reduced influence on parameter updates, while more attention should be given to underrepresented classes. To address these challenges, we introduce TaMatch, a unified framework for debiased training in SSL. TaMatch employs a scaling ratio derived from both a prior target distribution and the model’s learning status to estimate and correct bias at each training step. This ratio adjusts the raw predictions on unlabeled data to produce debiased pseudo labels. In the utilization phase, these labels are differently weighted according to their predicted class, enhancing training equity and minimizing class bias. Additionally, TaMatch dynamically adjust the target distribution in response to the model’s learning progress, facilitating robust handling of practical scenarios where the prior distribution is unknown. Empirical evaluations show that TaMatch significantly outperforms existing state-of-the-art methods across a range of challenging image classification tasks, highlighting the critical importance of both the debiased generation and utilization of pseudo labels in SSL.

[LG-83] Realistic Evaluation of Model Merging for Compositional Generalization

链接: https://arxiv.org/abs/2409.18314
作者: Derek Tam,Yash Kant,Brian Lester,Igor Gilitschenski,Colin Raffel
关键词-EN: cheaply combine individual, combine individual models, attains better performance, cheaply combine, combine individual
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Merging has become a widespread way to cheaply combine individual models into a single model that inherits their capabilities and attains better performance. This popularity has spurred rapid development of many new merging methods, which are typically validated in disparate experimental settings and frequently differ in the assumptions made about model architecture, data availability, and computational budget. In this work, we characterize the relative merits of different merging methods by evaluating them in a shared experimental setting and precisely identifying the practical requirements of each method. Specifically, our setting focuses on using merging for compositional generalization of capabilities in image classification, image generation, and natural language processing. Additionally, we measure the computational costs of different merging methods as well as how they perform when scaling the number of models being merged. Taken together, our results clarify the state of the field of model merging and provide a comprehensive and rigorous experimental setup to test new methods.

[LG-84] Embodied-RAG: General non-parametric Embodied Memory for Retrieval and Generation

链接: https://arxiv.org/abs/2409.18313
作者: Quanting Xie,So Yeon Min,Tianyi Zhang,Aarav Bajaj,Ruslan Salakhutdinov,Matthew Johnson-Roberson,Yonatan Bisk
关键词-EN: searchable and actionable, robot might explore, retrieval augmented generation, cs.RO, knowledge
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Web: this https URL

点击查看摘要

Abstract:There is no limit to how much a robot might explore and learn, but all of that knowledge needs to be searchable and actionable. Within language research, retrieval augmented generation (RAG) has become the workhouse of large-scale non-parametric knowledge, however existing techniques do not directly transfer to the embodied domain, which is multimodal, data is highly correlated, and perception requires abstraction. To address these challenges, we introduce Embodied-RAG, a framework that enhances the foundational model of an embodied agent with a non-parametric memory system capable of autonomously constructing hierarchical knowledge for both navigation and language generation. Embodied-RAG handles a full range of spatial and semantic resolutions across diverse environments and query types, whether for a specific object or a holistic description of ambiance. At its core, Embodied-RAG’s memory is structured as a semantic forest, storing language descriptions at varying levels of detail. This hierarchical organization allows the system to efficiently generate context-sensitive outputs across different robotic platforms. We demonstrate that Embodied-RAG effectively bridges RAG to the robotics domain, successfully handling over 200 explanation and navigation queries across 19 environments, highlighting its promise for general-purpose non-parametric system for embodied agents. Comments: Web: this https URL Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2409.18313 [cs.RO] (or arXiv:2409.18313v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2409.18313 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-85] Harnessing Wavelet Transformations for Generalizable Deepfake Forgery Detection

链接: https://arxiv.org/abs/2409.18301
作者: Lalith Bharadwaj Baru,Shilhora Akshay Patel,Rohit Boddeda
关键词-EN: digital image manipulation, significantly challenges existing, deep generative models, challenges existing deepfake, significantly challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The evolution of digital image manipulation, particularly with the advancement of deep generative models, significantly challenges existing deepfake detection methods, especially when the origin of the deepfake is obscure. To tackle the increasing complexity of these forgeries, we propose \textbfWavelet-CLIP, a deepfake detection framework that integrates wavelet transforms with features derived from the ViT-L/14 architecture, pre-trained in the CLIP fashion. Wavelet-CLIP utilizes Wavelet Transforms to deeply analyze both spatial and frequency features from images, thus enhancing the model’s capability to detect sophisticated deepfakes. To verify the effectiveness of our approach, we conducted extensive evaluations against existing state-of-the-art methods for cross-dataset generalization and detection of unseen images generated by standard diffusion models. Our method showcases outstanding performance, achieving an average AUC of 0.749 for cross-data generalization and 0.893 for robustness against unseen deepfakes, outperforming all compared methods. The code can be reproduced from the repo: \urlthis https URL

[LG-86] SOAR: Self-supervision Optimized UAV Action Recognition with Efficient Object-Aware Pretraining

链接: https://arxiv.org/abs/2409.18300
作者: Ruiqi Xian,Xiyang Wu,Tianrui Guan,Xijun Wang,Boqing Gong,Dinesh Manocha
关键词-EN: Unmanned Aerial Vehicles, aerial footage captured, Aerial Vehicles, Unmanned Aerial, captured by Unmanned
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:We introduce SOAR, a novel Self-supervised pretraining algorithm for aerial footage captured by Unmanned Aerial Vehicles (UAVs). We incorporate human object knowledge throughout the pretraining process to enhance UAV video pretraining efficiency and downstream action recognition performance. This is in contrast to prior works that primarily incorporate object information during the fine-tuning stage. Specifically, we first propose a novel object-aware masking strategy designed to retain the visibility of certain patches related to objects throughout the pretraining phase. Second, we introduce an object-aware loss function that utilizes object information to adjust the reconstruction loss, preventing bias towards less informative background patches. In practice, SOAR with a vanilla ViT backbone, outperforms best UAV action recognition models, recording a 9.7% and 21.4% boost in top-1 accuracy on the NEC-Drone and UAV-Human datasets, while delivering an inference speed of 18.7ms per video, making it 2x to 5x faster. Additionally, SOAR obtains comparable accuracy to prior self-supervised learning (SSL) methods while requiring 87.5% less pretraining time and 25% less memory usage

[LG-87] Causality-based Subject and Task Fingerprints using fMRI Time-series Data

链接: https://arxiv.org/abs/2409.18298
作者: Dachuan Song,Li Shen,Duy Duong-Tran,Xuan Wang
关键词-EN: system neuroscience causation, unravel complex relationships, causation models due, neuroscience causation models, multi-scale brain networks
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Recently, there has been a revived interest in system neuroscience causation models due to their unique capability to unravel complex relationships in multi-scale brain networks. In this paper, our goal is to verify the feasibility and effectiveness of using a causality-based approach for fMRI fingerprinting. Specifically, we propose an innovative method that utilizes the causal dynamics activities of the brain to identify the unique cognitive patterns of individuals (e.g., subject fingerprint) and fMRI tasks (e.g., task fingerprint). The key novelty of our approach stems from the development of a two-timescale linear state-space model to extract ‘spatio-temporal’ (aka causal) signatures from an individual’s fMRI time series data. To the best of our knowledge, we pioneer and subsequently quantify, in this paper, the concept of ‘causal fingerprint.’ Our method is well-separated from other fingerprint studies as we quantify fingerprints from a cause-and-effect perspective, which are then incorporated with a modal decomposition and projection method to perform subject identification and a GNN-based (Graph Neural Network) model to perform task identification. Finally, we show that the experimental results and comparisons with non-causality-based methods demonstrate the effectiveness of the proposed methods. We visualize the obtained causal signatures and discuss their biological relevance in light of the existing understanding of brain functionalities. Collectively, our work paves the way for further studies on causal fingerprints with potential applications in both healthy controls and neurodegenerative diseases.

[LG-88] Enhancing Lossy Compression Through Cross-Field Information for Scientific Applications

链接: https://arxiv.org/abs/2409.18295
作者: Youyuan Liu,Wenqi Jia,Taolue Yang,Miao Yin,Sian Jin
关键词-EN: effective methods, methods for reducing, reducing the size, data, multiple data fields
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 9 pages, 9 figures, accepted by DRBSD-10

点击查看摘要

Abstract:Lossy compression is one of the most effective methods for reducing the size of scientific data containing multiple data fields. It reduces information density through prediction or transformation techniques to compress the data. Previous approaches use local information from a single target field when predicting target data points, limiting their potential to achieve higher compression ratios. In this paper, we identified significant cross-field correlations within scientific datasets. We propose a novel hybrid prediction model that utilizes CNN to extract cross-field information and combine it with existing local field information. Our solution enhances the prediction accuracy of lossy compressors, leading to improved compression ratios without compromising data quality. We evaluate our solution on three scientific datasets, demonstrating its ability to improve compression ratios by up to 25% under specific error bounds. Additionally, our solution preserves more data details and reduces artifacts compared to baseline approaches.

[LG-89] Criticality and Safety Margins for Reinforcement Learning

链接: https://arxiv.org/abs/2409.18289
作者: Alexander Grushin,Walt Woods,Alvaro Velasquez,Simon Khan
关键词-EN: art reinforcement learning, reinforcement learning methods, encounter unsafe situations, art reinforcement, reinforcement learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: 17 pages, 10 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:State of the art reinforcement learning methods sometimes encounter unsafe situations. Identifying when these situations occur is of interest both for post-hoc analysis and during deployment, where it might be advantageous to call out to a human overseer for help. Efforts to gauge the criticality of different points in time have been developed, but their accuracy is not well established due to a lack of ground truth, and they are not designed to be easily interpretable by end users. Therefore, we seek to define a criticality framework with both a quantifiable ground truth and a clear significance to users. We introduce true criticality as the expected drop in reward when an agent deviates from its policy for n consecutive random actions. We also introduce the concept of proxy criticality, a low-overhead metric that has a statistically monotonic relationship to true criticality. Safety margins make these interpretable, when defined as the number of random actions for which performance loss will not exceed some tolerance with high confidence. We demonstrate this approach in several environment-agent combinations; for an A3C agent in an Atari Beamrider environment, the lowest 5% of safety margins contain 47% of agent losses; i.e., supervising only 5% of decisions could potentially prevent roughly half of an agent’s errors. This criticality framework measures the potential impacts of bad decisions, even before those decisions are made, allowing for more effective debugging and oversight of autonomous agents.

[LG-90] SLIDE: A machine-learning based method for forced dynamic response estimation of multibody systems

链接: https://arxiv.org/abs/2409.18272
作者: Peter Manzl,Alexander Humer,Qasim Khadim,Johannes Gerstmayr
关键词-EN: Initially-truncated Dynamic-response Estimator, SLiding-window Initially-truncated Dynamic-response, computational engineering, perpetual goal, speed and efficiency
类目: Machine Learning (cs.LG)
*备注: Paper currently in submission for journal publication

点击查看摘要

Abstract:In computational engineering, enhancing the simulation speed and efficiency is a perpetual goal. To fully take advantage of neural network techniques and hardware, we present the SLiding-window Initially-truncated Dynamic-response Estimator (SLIDE), a deep learning-based method designed to estimate output sequences of mechanical or multibody systems with primarily, but not exclusively, forced excitation. A key advantage of SLIDE is its ability to estimate the dynamic response of damped systems without requiring the full system state, making it particularly effective for flexible multibody systems. The method truncates the output window based on the decay of initial effects, such as damping, which is approximated by the complex eigenvalues of the systems linearized equations. In addition, a second neural network is trained to provide an error estimation, further enhancing the methods applicability. The method is applied to a diverse selection of systems, including the Duffing oscillator, a flexible slider-crank system, and an industrial 6R manipulator, mounted on a flexible socket. Our results demonstrate significant speedups from the simulation up to several millions, exceeding real-time performance substantially.

[LG-91] Using dynamic loss weighting to boost improvements in forecast stability

链接: https://arxiv.org/abs/2409.18267
作者: Daan Caljon,Jeff Vercauteren,Simon De Vos,Wouter Verbeke,Jente Van Belle
关键词-EN: Rolling origin forecast, specific period induced, Rolling origin, origin forecast instability, forecast instability refers
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Rolling origin forecast instability refers to variability in forecasts for a specific period induced by updating the forecast when new data points become available. Recently, an extension to the N-BEATS model for univariate time series point forecasting was proposed to include forecast stability as an additional optimization objective, next to accuracy. It was shown that more stable forecasts can be obtained without harming accuracy by minimizing a composite loss function that contains both a forecast error and a forecast instability component, with a static hyperparameter to control the impact of stability. In this paper, we empirically investigate whether further improvements in stability can be obtained without compromising accuracy by applying dynamic loss weighting algorithms, which change the loss weights during training. We show that some existing dynamic loss weighting methods achieve this objective. However, our proposed extension to the Random Weighting approach – Task-Aware Random Weighting – shows the best performance.

[LG-92] ask-recency bias strikes back: Adapting covariances in Exemplar-Free Class Incremental Learning NEURIPS2024

链接: https://arxiv.org/abs/2409.18265
作者: Grzegorz Rypeść,Sebastian Cygert,Tomasz Trzciński,Bartłomiej Twardowski
关键词-EN: Class Incremental Learning, Exemplar-Free Class Incremental, Exemplar-Free Class, Incremental Learning, Class Incremental
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for NeurIPS 2024

点击查看摘要

Abstract:Exemplar-Free Class Incremental Learning (EFCIL) tackles the problem of training a model on a sequence of tasks without access to past data. Existing state-of-the-art methods represent classes as Gaussian distributions in the feature extractor’s latent space, enabling Bayes classification or training the classifier by replaying pseudo features. However, we identify two critical issues that compromise their efficacy when the feature extractor is updated on incremental tasks. First, they do not consider that classes’ covariance matrices change and must be adapted after each task. Second, they are susceptible to a task-recency bias caused by dimensionality collapse occurring during training. In this work, we propose AdaGauss – a novel method that adapts covariance matrices from task to task and mitigates the task-recency bias owing to the additional anti-collapse loss function. AdaGauss yields state-of-the-art results on popular EFCIL benchmarks and datasets when training from scratch or starting from a pre-trained backbone. The code is available at: this https URL.

[LG-93] DisGeM: Distractor Generation for Multiple Choice Questions with Span Masking

链接: https://arxiv.org/abs/2409.18263
作者: Devrim Cavusoglu,Secil Sen,Ulas Sert
关键词-EN: Natural Language Processing, natural language inference, Natural Language, Recent advancements, impacted numerous sub-fields
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in Natural Language Processing (NLP) have impacted numerous sub-fields such as natural language generation, natural language inference, question answering, and more. However, in the field of question generation, the creation of distractors for multiple-choice questions (MCQ) remains a challenging task. In this work, we present a simple, generic framework for distractor generation using readily available Pre-trained Language Models (PLMs). Unlike previous methods, our framework relies solely on pre-trained language models and does not require additional training on specific datasets. Building upon previous research, we introduce a two-stage framework consisting of candidate generation and candidate selection. Our proposed distractor generation framework outperforms previous methods without the need for training or fine-tuning. Human evaluations confirm that our approach produces more effective and engaging distractors. The related codebase is publicly available at this https URL.

[LG-94] Development of an Edge Resilient ML Ensemble to Tolerate ICS Adversarial Attacks

链接: https://arxiv.org/abs/2409.18244
作者: Likai Yao,Qinxuan Shi,Zhanglong Yang,Sicong Shao,Salim Hariri
关键词-EN: data-driven applications systems, Deploying machine learning, dynamic data-driven applications, Deploying machine, applications systems
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted by Dynamic Data Driven Applications Systems: International Conference, DDDAS, Springer. 2024

点击查看摘要

Abstract:Deploying machine learning (ML) in dynamic data-driven applications systems (DDDAS) can improve the security of industrial control systems (ICS). However, ML-based DDDAS are vulnerable to adversarial attacks because adversaries can alter the input data slightly so that the ML models predict a different result. In this paper, our goal is to build a resilient edge machine learning (reML) architecture that is designed to withstand adversarial attacks by performing Data Air Gap Transformation (DAGT) to anonymize data feature spaces using deep neural networks and randomize the ML models used for predictions. The reML is based on the Resilient DDDAS paradigm, Moving Target Defense (MTD) theory, and TinyML and is applied to combat adversarial attacks on ICS. Furthermore, the proposed approach is power-efficient and privacy-preserving and, therefore, can be deployed on power-constrained devices to enhance ICS security. This approach enables resilient ML inference at the edge by shifting the computation from the computing-intensive platforms to the resource-constrained edge devices. The incorporation of TinyML with TensorFlow Lite ensures efficient resource utilization and, consequently, makes reML suitable for deployment in various industrial control environments. Furthermore, the dynamic nature of reML, facilitated by the resilient DDDAS development environment, allows for continuous adaptation and improvement in response to emerging threats. Lastly, we evaluate our approach on an ICS dataset and demonstrate that reML provides a viable and effective solution for resilient ML inference at the edge devices.

[LG-95] owards sub-millisecond latency real-time speech enhancement models on hearables

链接: https://arxiv.org/abs/2409.18239
作者: Artem Dementyev,Chandan K. A. Reddy,Scott Wisdom,Navin Chatlani,John R. Hershey,Richard F.Lyon
关键词-EN: Low latency models, speech enhancement applications, real-time speech enhancement, Low latency, critical for real-time
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Low latency models are critical for real-time speech enhancement applications, such as hearing aids and hearables. However, the sub-millisecond latency space for resource-constrained hearables remains underexplored. We demonstrate speech enhancement using a computationally efficient minimum-phase FIR filter, enabling sample-by-sample processing to achieve mean algorithmic latency of 0.32 ms to 1.25 ms. With a single microphone, we observe a mean SI-SDRi of 4.1 dB. The approach shows generalization with a DNSMOS increase of 0.2 on unseen audio recordings. We use a lightweight LSTM-based model of 644k parameters to generate FIR taps. We benchmark that our system can run on low-power DSP with 388 MIPS and mean end-to-end latency of 3.35 ms. We provide a comparison with baseline low-latency spectral masking techniques. We hope this work will enable a better understanding of latency and can be used to improve the comfort and usability of hearables.

[LG-96] Spatial Visibility and Temporal Dynamics: Revolutionizing Field of View Prediction in Adaptive Point Cloud Video Streaming

链接: https://arxiv.org/abs/2409.18236
作者: Chen Li,Tongyu Zong,Yueyu Hu,Yao Wang,Yong Liu
关键词-EN: reduces bandwidth requirement, transmitting visible points, adaptive streaming significantly, streaming significantly reduces, significantly reduces bandwidth
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Field-of-View (FoV) adaptive streaming significantly reduces bandwidth requirement of immersive point cloud video (PCV) by only transmitting visible points in a viewer’s FoV. The traditional approaches often focus on trajectory-based 6 degree-of-freedom (6DoF) FoV predictions. The predicted FoV is then used to calculate point visibility. Such approaches do not explicitly consider video content’s impact on viewer attention, and the conversion from FoV to point visibility is often error-prone and time-consuming. We reformulate the PCV FoV prediction problem from the cell visibility perspective, allowing for precise decision-making regarding the transmission of 3D data at the cell level based on the predicted visibility distribution. We develop a novel spatial visibility and object-aware graph model that leverages the historical 3D visibility data and incorporates spatial perception, neighboring cell correlation, and occlusion information to predict the cell visibility in the future. Our model significantly improves the long-term cell visibility prediction, reducing the prediction MSE loss by up to 50% compared to the state-of-the-art models while maintaining real-time performance (more than 30fps) for point cloud videos with over 1 million points.

[LG-97] Visual Concept Networks: A Graph-Based Approach to Detecting Anomalous Data in Deep Neural Networks

链接: https://arxiv.org/abs/2409.18235
作者: Debargha Ganguly,Debayan Gupta,Vipin Chaudhary
关键词-EN: Deep neural networks, Deep neural, struggle with robustness, increasingly deployed, robustness against anomalous
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs), while increasingly deployed in many applications, struggle with robustness against anomalous and out-of-distribution (OOD) data. Current OOD benchmarks often oversimplify, focusing on single-object tasks and not fully representing complex real-world anomalies. This paper introduces a new, straightforward method employing graph structures and topological features to effectively detect both far-OOD and near-OOD data. We convert images into networks of interconnected human understandable features or visual concepts. Through extensive testing on two novel tasks, including ablation studies with large vocabularies and diverse tasks, we demonstrate the method’s effectiveness. This approach enhances DNN resilience to OOD data and promises improved performance in various applications.

[LG-98] Revolutionizing Payload Inspection: A Self-Supervised Journey to Precision with Few Shots

链接: https://arxiv.org/abs/2409.18219
作者: Kyle Stein,Arash Mahyari,Guillermo Francia III,Eman El-Sheikh
关键词-EN: continue to expand, malware detection methods, Deep Packet Inspection, networks continue, malware detection
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As networks continue to expand and become more interconnected, the need for novel malware detection methods becomes more pronounced. Traditional security measures are increasingly inadequate against the sophistication of modern cyber attacks. Deep Packet Inspection (DPI) has been pivotal in enhancing network security, offering an in-depth analysis of network traffic that surpasses conventional monitoring techniques. DPI not only examines the metadata of network packets, but also dives into the actual content being carried within the packet payloads, providing a comprehensive view of the data flowing through networks. The integration of advanced deep learning techniques with DPI has introduced modern methodologies into malware detection. However, the challenge with the state-of-the-art supervised learning approaches is that they prevent the generalization to unseen attacks embedded in the payloads, prohibiting them from accurately detecting new attacks and transferring knowledge learned from previous attacks to the new attacks with small labeled sample sizes. This paper leverages the recent advancements in self-supervised learning and few-shot learning. Our proposed self-supervised approach trains a transformer to learn the embedding of the payloads from a vast amount of unlabeled datasets by masking portions of payloads, leading to a learnt representation that well generalizes to various downstream tasks. Once the representation is extracted from payloads, they are used to train a malware detection algorithm. The representation obtained from the transformer is then used to adapt the malware detector to novel types of attacks using few-shot learning approaches. Our experimental results across several datasets show the great success and generalization of the proposed approach to novel scenarios.

[LG-99] Learning to Drive via Asymmetric Self-Play ECCV2024

链接: https://arxiv.org/abs/2409.18218
作者: Chris Zhang,Sourav Biswas,Kelvin Wong,Kion Fallah,Lunjun Zhang,Dian Chen,Sergio Casas,Raquel Urtasun
关键词-EN: Large-scale data, crucial for learning, Large-scale, data, real data
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: ECCV 2024

点击查看摘要

Abstract:Large-scale data is crucial for learning realistic and capable driving policies. However, it can be impractical to rely on scaling datasets with real data alone. The majority of driving data is uninteresting, and deliberately collecting new long-tail scenarios is expensive and unsafe. We propose asymmetric self-play to scale beyond real data with additional challenging, solvable, and realistic synthetic scenarios. Our approach pairs a teacher that learns to generate scenarios it can solve but the student cannot, with a student that learns to solve them. When applied to traffic simulation, we learn realistic policies with significantly fewer collisions in both nominal and long-tail scenarios. Our policies further zero-shot transfer to generate training data for end-to-end autonomy, significantly outperforming state-of-the-art adversarial approaches, or using real data alone. For more information, visit this https URL .

[LG-100] MMMT-IF: A Challenging Multimodal Multi-Turn Instruction Following Benchmark

链接: https://arxiv.org/abs/2409.18216
作者: Elliot L. Epstein,Kaisheng Yao,Jing Li,Xinyi Bai,Hamid Palangi
关键词-EN: Evaluating instruction, instructions, operatorname, PIF, capabilities for multimodal
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 24 pages, 16 figures

点击查看摘要

Abstract:Evaluating instruction following capabilities for multimodal, multi-turn dialogue is challenging. With potentially multiple instructions in the input model context, the task is time-consuming for human raters and we show LLM based judges are biased towards answers from the same model. We propose MMMT-IF, an image based multi-turn Q \ A evaluation set with added global instructions between questions, constraining the answer format. This challenges models to retrieve instructions dispersed across long dialogues and reason under instruction constraints. All instructions are objectively verifiable through code execution. We introduce the Programmatic Instruction Following ( \operatornamePIF ) metric to measure the fraction of the instructions that are correctly followed while performing a reasoning task. The \operatornamePIF-N-K set of metrics further evaluates robustness by measuring the fraction of samples in a corpus where, for each sample, at least K out of N generated model responses achieve a \operatornamePIF score of one. The \operatornamePIF metric aligns with human instruction following ratings, showing 60 percent correlation. Experiments show Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet, have a \operatornamePIF metric that drops from 0.81 on average at turn 1 across the models, to 0.64 at turn 20. Across all turns, when each response is repeated 4 times ( \operatornamePIF-4-4 ), GPT-4o and Gemini successfully follow all instructions only 11% of the time. When all the instructions are also appended to the end of the model input context, the \operatornamePIF metric improves by 22.3 points on average, showing that the challenge with the task lies not only in following the instructions, but also in retrieving the instructions spread out in the model context. We plan to open source the MMMT-IF dataset and metric computation code.

[LG-101] rustworthy Text-to-Image Diffusion Models: A Timely and Focused Survey

链接: https://arxiv.org/abs/2409.18214
作者: Yi Zhang,Zhen Chen,Chih-Hong Cheng,Wenjie Ruan,Xiaowei Huang,Dezong Zhao,David Flynn,Siddartha Khastgir,Xingyu Zhao
关键词-EN: Diffusion Models, garnered widespread attention, image generation, garnered widespread, widespread attention
类目: Machine Learning (cs.LG)
*备注: under review

点击查看摘要

Abstract:Text-to-Image (T2I) Diffusion Models (DMs) have garnered widespread attention for their impressive advancements in image generation. However, their growing popularity has raised ethical and social concerns related to key non-functional properties of trustworthiness, such as robustness, fairness, security, privacy, factuality, and explainability, similar to those in traditional deep learning (DL) tasks. Conventional approaches for studying trustworthiness in DL tasks often fall short due to the unique characteristics of T2I DMs, e.g., the multi-modal nature. Given the challenge, recent efforts have been made to develop new methods for investigating trustworthiness in T2I DMs via various means, including falsification, enhancement, verification \ validation and assessment. However, there is a notable lack of in-depth analysis concerning those non-functional properties and means. In this survey, we provide a timely and focused review of the literature on trustworthy T2I DMs, covering a concise-structured taxonomy from the perspectives of property, means, benchmarks and applications. Our review begins with an introduction to essential preliminaries of T2I DMs, and then we summarise key definitions/metrics specific to T2I tasks and analyses the means proposed in recent literature based on these definitions/metrics. Additionally, we review benchmarks and domain applications of T2I DMs. Finally, we highlight the gaps in current research, discuss the limitations of existing methods, and propose future research directions to advance the development of trustworthy T2I DMs. Furthermore, we keep up-to-date updates in this field to track the latest developments and maintain our GitHub repository at: this https URL

[LG-102] Bridging OOD Detection and Generalization: A Graph-Theoretic View NEURIPS2024

链接: https://arxiv.org/abs/2409.18205
作者: Han Wang,Yixuan Li
关键词-EN: modern machine learning, diverse data shifts, encounter diverse data, machine learning, models deployed
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: NeurIPS 2024. arXiv admin note: text overlap with arXiv:2310.06221 by other authors

点击查看摘要

Abstract:In the context of modern machine learning, models deployed in real-world scenarios often encounter diverse data shifts like covariate and semantic shifts, leading to challenges in both out-of-distribution (OOD) generalization and detection. Despite considerable attention to these issues separately, a unified framework for theoretical understanding and practical usage is lacking. To bridge the gap, we introduce a graph-theoretic framework to jointly tackle both OOD generalization and detection problems. By leveraging the graph formulation, data representations are obtained through the factorization of the graph’s adjacency matrix, enabling us to derive provable error quantifying OOD generalization and detection performance. Empirical results showcase competitive performance in comparison to existing methods, thereby validating our theoretical underpinnings. Code is publicly available at this https URL.

[LG-103] AI Policy Projector: Grounding LLM Policy Design in Iterative Mapmaking

链接: https://arxiv.org/abs/2409.18203
作者: Michelle S. Lam,Fred Hohman,Dominik Moritz,Jeffrey P. Bigham,Kenneth Holstein,Mary Beth Kery
关键词-EN: large language model, implicit reward model, large language, explicit constitution, implicit reward
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Whether a large language model policy is an explicit constitution or an implicit reward model, it is challenging to assess coverage over the unbounded set of real-world situations that a policy must contend with. We introduce an AI policy design process inspired by mapmaking, which has developed tactics for visualizing and iterating on maps even when full coverage is not possible. With Policy Projector, policy designers can survey the landscape of model input-output pairs, define custom regions (e.g., “violence”), and navigate these regions with rules that can be applied to LLM outputs (e.g., if output contains “violence” and “graphic details,” then rewrite without “graphic details”). Policy Projector supports interactive policy authoring using LLM classification and steering and a map visualization reflecting the policy designer’s work. In an evaluation with 12 AI safety experts, our system helps policy designers to address problematic model behaviors extending beyond an existing, comprehensive harm taxonomy.

[LG-104] Autonomous Network Defence using Reinforcement Learning

链接: https://arxiv.org/abs/2409.18197
作者: Myles Foley,Chris Hicks,Kate Highnam,Vasilios Mavroudis
关键词-EN: security arms race, network security arms, arms race, security arms, defender is significantly
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the network security arms race, the defender is significantly disadvantaged as they need to successfully detect and counter every malicious attack. In contrast, the attacker needs to succeed only once. To level the playing field, we investigate the effectiveness of autonomous agents in a realistic network defence scenario. We first outline the problem, provide the background on reinforcement learning and detail our proposed agent design. Using a network environment simulation, with 13 hosts spanning 3 subnets, we train a novel reinforcement learning agent and show that it can reliably defend continual attacks by two advanced persistent threat (APT) red agents: one with complete knowledge of the network layout and another which must discover resources through exploration but is more general.

[LG-105] Harmful Fine-tuning Attacks and Defenses for Large Language Models : A Survey

链接: https://arxiv.org/abs/2409.18169
作者: Tiansheng Huang,Sihao Hu,Fatih Ilhan,Selim Furkan Tekin,Ling Liu
关键词-EN: Recent research demonstrates, harmful data uploaded, business model exposes, Recent research, harmful fine-tuning attack
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent research demonstrates that the nascent fine-tuning-as-a-service business model exposes serious safety concerns – fine-tuning over a few harmful data uploaded by the users can compromise the safety alignment of the model. The attack, known as harmful fine-tuning, has raised a broad research interest among the community. However, as the attack is still new, \textbfwe observe from our miserable submission experience that there are general misunderstandings within the research community. We in this paper aim to clear some common concerns for the attack setting, and formally establish the research problem. Specifically, we first present the threat model of the problem, and introduce the harmful fine-tuning attack and its variants. Then we systematically survey the existing literature on attacks/defenses/mechanical analysis of the problem. Finally, we outline future research directions that might contribute to the development of the field. Additionally, we present a list of questions of interest, which might be useful to refer to when reviewers in the peer review process question the realism of the experiment/attack/defense setting. A curated list of relevant papers is maintained and made accessible at: \urlthis https URL.

[LG-106] Jump Diffusion-Informed Neural Networks with Transfer Learning for Accurate American Option Pricing under Data Scarcity

链接: https://arxiv.org/abs/2409.18168
作者: Qiguo Sun,Hanyue Huang,XiBei Yang,Yuwei Zhang
关键词-EN:
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-107] Data-Prep-Kit: getting your data ready for LLM application development

链接: https://arxiv.org/abs/2409.18164
作者: David Wood,Boris Lublinsky,Alexy Roytman,Shivdeep Singh,Abdulhamid Adebayo,Revital Eres,Mohammad Nassar,Hima Patel,Yousaf Shah,Constantin Adam,Petros Zerfos,Nirmit Desai,Daiki Tsuzuku,Takuya Goto,Michele Dolfi,Saptha Surendran,Paramesvaran Selvam,Sungeun An,Yuan Chi Chang,Dhiraj Joshi,Hajar Emami-Gohari,Xuan-Hong Dang,Yan Koyfman,Shahrokh Daijavad
关键词-EN: Data Prep Kit, DPK, Data preparation, Prep Kit, Data
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 10 pages, 7 figures

点击查看摘要

Abstract:Data preparation is the first and a very important step towards any Large Language Model (LLM) development. This paper introduces an easy-to-use, extensible, and scale-flexible open-source data preparation toolkit called Data Prep Kit (DPK). DPK is architected and designed to enable users to scale their data preparation to their needs. With DPK they can prepare data on a local machine or effortlessly scale to run on a cluster with thousands of CPU Cores. DPK comes with a highly scalable, yet extensible set of modules that transform natural language and code data. If the user needs additional transforms, they can be easily developed using extensive DPK support for transform creation. These modules can be used independently or pipelined to perform a series of operations. In this paper, we describe DPK architecture and show its performance from a small scale to a very large number of CPUs. The modules from DPK have been used for the preparation of Granite Models [1] [2]. We believe DPK is a valuable contribution to the AI community to easily prepare data to enhance the performance of their LLM models or to fine-tune models with Retrieval-Augmented Generation (RAG).

[LG-108] A Survey on Neural Architecture Search Based on Reinforcement Learning

链接: https://arxiv.org/abs/2409.18163
作者: Wenzhu Shao
关键词-EN: Neural Architecture Search, Architecture Search, Neural Architecture, feature extraction, extraction of machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The automation of feature extraction of machine learning has been successfully realized by the explosive development of deep learning. However, the structures and hyperparameters of deep neural network architectures also make huge difference on the performance in different tasks. The process of exploring optimal structures and hyperparameters often involves a lot of tedious human intervene. As a result, a legitimate question is to ask for the automation of searching for optimal network structures and hyperparameters. The work of automation of exploring optimal hyperparameters is done by Hyperparameter Optimization. Neural Architecture Search is aimed to automatically find the best network structure given specific tasks. In this paper, we firstly introduced the overall development of Neural Architecture Search and then focus mainly on providing an overall and understandable survey about Neural Architecture Search works that are relevant with reinforcement learning, including improvements and variants based on the hope of satisfying more complex structures and resource-insufficient environment.

[LG-109] Most Influential Subset Selection: Challenges Promises and Beyond NEURIPS2024

链接: https://arxiv.org/abs/2409.18153
作者: Yuzheng Hu,Pingbang Hu,Han Zhao,Jiaqi W. Ma
关键词-EN: machine learning models, Influential Subset Selection, collective influence, attribute the behaviors, behaviors of machine
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:How can we attribute the behaviors of machine learning models to their training data? While the classic influence function sheds light on the impact of individual samples, it often fails to capture the more complex and pronounced collective influence of a set of samples. To tackle this challenge, we study the Most Influential Subset Selection (MISS) problem, which aims to identify a subset of training samples with the greatest collective influence. We conduct a comprehensive analysis of the prevailing approaches in MISS, elucidating their strengths and weaknesses. Our findings reveal that influence-based greedy heuristics, a dominant class of algorithms in MISS, can provably fail even in linear regression. We delineate the failure modes, including the errors of influence function and the non-additive structure of the collective influence. Conversely, we demonstrate that an adaptive version of these heuristics which applies them iteratively, can effectively capture the interactions among samples and thus partially address the issues. Experiments on real-world datasets corroborate these theoretical findings, and further demonstrate that the merit of adaptivity can extend to more complex scenarios such as classification tasks and non-linear neural networks. We conclude our analysis by emphasizing the inherent trade-off between performance and computational efficiency, questioning the use of additive metrics such as the linear datamodeling score, and offering a range of discussions.

[LG-110] Reinforcement Learning for Finite Space Mean-Field Type Games

链接: https://arxiv.org/abs/2409.18152
作者: Kai Shao,Jiacheng Shen,Chijie An,Mathieu Laurière
关键词-EN: describe Nash equilibria, field type games, field type, continuum of cooperative, cooperative agents
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Mean field type games (MFTGs) describe Nash equilibria between large coalitions: each coalition consists of a continuum of cooperative agents who maximize the average reward of their coalition while interacting non-cooperatively with a finite number of other coalitions. Although the theory has been extensively developed, we are still lacking efficient and scalable computational methods. Here, we develop reinforcement learning methods for such games in a finite space setting with general dynamics and reward functions. We start by proving that MFTG solution yields approximate Nash equilibria in finite-size coalition games. We then propose two algorithms. The first is based on quantization of the mean-field spaces and Nash Q-learning. We provide convergence and stability analysis. We then propose an deep reinforcement learning algorithm, which can scale to larger spaces. Numerical examples on 5 environments show the scalability and the efficiency of the proposed method.

[LG-111] Unconditional stability of a recurrent neural circuit implementing divisive normalization

链接: https://arxiv.org/abs/2409.18946
作者: Shivang Rawat,David J. Heeger,Stefano Martiniani
关键词-EN:
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

[LG-112] Simulating Dynamic Tumor Contrast Enhancement in Breast MRI using Conditional Generative Adversarial Networks

链接: https://arxiv.org/abs/2409.18872
作者: Richard Osuala,Smriti Joshi,Apostolia Tsirikoglou,Lidia Garrucho,Walter H.L. Pinaya,Daniel M. Lang,Julia A. Schnabel,Oliver Diaz,Karim Lekadir
关键词-EN: agent-based DCE-MRI acquisition, traditional contrast agent-based, promising non-invasive alternative, Scaled Aggregate Measure, paper presents
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a method for virtual contrast enhancement in breast MRI, offering a promising non-invasive alternative to traditional contrast agent-based DCE-MRI acquisition. Using a conditional generative adversarial network, we predict DCE-MRI images, including jointly-generated sequences of multiple corresponding DCE-MRI timepoints, from non-contrast-enhanced MRIs, enabling tumor localization and characterization without the associated health risks. Furthermore, we qualitatively and quantitatively evaluate the synthetic DCE-MRI images, proposing a multi-metric Scaled Aggregate Measure (SAMe), assessing their utility in a tumor segmentation downstream task, and conclude with an analysis of the temporal patterns in multi-sequence DCE-MRI generation. Our approach demonstrates promising results in generating realistic and useful DCE-MRI sequences, highlighting the potential of virtual contrast enhancement for improving breast cancer diagnosis and treatment, particularly for patients where contrast agent administration is contraindicated.

[LG-113] Positional Encoder Graph Quantile Neural Networks for Geographic Data

链接: https://arxiv.org/abs/2409.18865
作者: William E. R. de Amorim,Scott A. Sisson,T. Rodrigues,David J. Nott,Guilherme S. Rodrigues
关键词-EN: Positional Encoder Graph, Encoder Graph Neural, Graph Neural Networks, Graph Quantile Neural, Encoder Graph Quantile
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 17 main text pages, 4 figures

点击查看摘要

Abstract:Positional Encoder Graph Neural Networks (PE-GNNs) are a leading approach for modeling continuous spatial data. However, they often fail to produce calibrated predictive distributions, limiting their effectiveness for uncertainty quantification. We introduce the Positional Encoder Graph Quantile Neural Network (PE-GQNN), a novel method that integrates PE-GNNs, Quantile Neural Networks, and recalibration techniques in a fully nonparametric framework, requiring minimal assumptions about the predictive distributions. We propose a new network architecture that, when combined with a quantile-based loss function, yields accurate and reliable probabilistic models without increasing computational complexity. Our approach provides a flexible, robust framework for conditional density estimation, applicable beyond spatial data contexts. We further introduce a structured method for incorporating a KNN predictor into the model while avoiding data leakage through the GNN layer operation. Experiments on benchmark datasets demonstrate that PE-GQNN significantly outperforms existing state-of-the-art methods in both predictive accuracy and uncertainty quantification.

[LG-114] Classical Statistical (In-Sample) Intuitions Dont Generalize Well: A Note on Bias-Variance Tradeoffs Overfitting and Moving from Fixed to Random Designs

链接: https://arxiv.org/abs/2409.18842
作者: Alicia Curth
关键词-EN: statisticians feeling uneasy, classically trained statisticians, trained statisticians feeling, modern machine learning, statistical intuitions conveyed
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The sudden appearance of modern machine learning (ML) phenomena like double descent and benign overfitting may leave many classically trained statisticians feeling uneasy – these phenomena appear to go against the very core of statistical intuitions conveyed in any introductory class on learning from data. The historical lack of earlier observation of such phenomena is usually attributed to today’s reliance on more complex ML methods, overparameterization, interpolation and/or higher data dimensionality. In this note, we show that there is another reason why we observe behaviors today that appear at odds with intuitions taught in classical statistics textbooks, which is much simpler to understand yet rarely discussed explicitly. In particular, many intuitions originate in fixed design settings, in which in-sample prediction error (under resampling of noisy outcomes) is of interest, while modern ML evaluates its predictions in terms of generalization error, i.e. out-of-sample prediction error in random designs. Here, we highlight that this simple move from fixed to random designs has (perhaps surprisingly) far-reaching consequences on textbook intuitions relating to the bias-variance tradeoff, and comment on the resulting (im)possibility of observing double descent and benign overfitting in fixed versus random designs.

[LG-115] Constructing Confidence Intervals for the Generalization Error – a Comprehensive Benchmark Study

链接: https://arxiv.org/abs/2409.18836
作者: Hannah Schulz-Kümpel,Sebastian Fischer,Thomas Nagler,Anne-Laure Boulesteix,Bernd Bischl,Roman Hornung
关键词-EN: measures predictive performance, confidence intervals, machine learning, predictive performance, crucial tool
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:When assessing the quality of prediction models in machine learning, confidence intervals (CIs) for the generalization error, which measures predictive performance, are a crucial tool. Luckily, there exist many methods for computing such CIs and new promising approaches are continuously being proposed. Typically, these methods combine various resampling procedures, most popular among them cross-validation and bootstrapping, with different variance estimation techniques. Unfortunately, however, there is currently no consensus on when any of these combinations may be most reliably employed and how they generally compare. In this work, we conduct the first large-scale study comparing CIs for the generalization error - empirically evaluating 13 different methods on a total of 18 tabular regression and classification problems, using four different inducers and a total of eight loss functions. We give an overview of the methodological foundations and inherent challenges of constructing CIs for the generalization error and provide a concise review of all 13 methods in a unified framework. Finally, the CI methods are evaluated in terms of their relative coverage frequency, width, and runtime. Based on these findings, we are able to identify a subset of methods that we would recommend. We also publish the datasets as a benchmarking suite on OpenML and our code on GitHub to serve as a basis for further studies.

[LG-116] Early diagnosis of Alzheimers disease from MRI images with deep learning model

链接: https://arxiv.org/abs/2409.18814
作者: Sajjad Aghasi Javid,Mahmood Mohassel Feghhi
关键词-EN: worldwide is Alzheimer, Alzheimer disease, Minority Oversampling Technique, Alzheimer, Synthetic Minority Oversampling
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 7 pages, 3 figures, Presented at the 20-th CSI International Symposium on Artificial Intelligence and Signal Processing (AISP) 21-22 February, 2024, Mazandaran University of Science and Technology, Babol, Iran

点击查看摘要

Abstract:It is acknowledged that the most common cause of dementia worldwide is Alzheimer’s disease (AD). This condition progresses in severity from mild to severe and interferes with people’s everyday routines. Early diagnosis plays a critical role in patient care and clinical trials. Convolutional neural networks (CNN) are used to create a framework for identifying specific disease features from MRI scans Classification of dementia involves approaches such as medical history review, neuropsychological tests, and magnetic resonance imaging (MRI). However, the image dataset obtained from Kaggle faces a significant issue of class imbalance, which requires equal distribution of samples from each class to address. In this article, to address this imbalance, the Synthetic Minority Oversampling Technique (SMOTE) is utilized. Furthermore, a pre-trained convolutional neural network has been applied to the DEMNET dementia network to extract key features from AD images. The proposed model achieved an impressive accuracy of 98.67%.

[LG-117] Convergence of Diffusion Models Under the Manifold Hypothesis in High-Dimensions

链接: https://arxiv.org/abs/2409.18804
作者: Iskander Azangulov,George Deligiannidis,Judith Rousseau
关键词-EN: Denoising Diffusion Probabilistic, Diffusion Probabilistic Models, generate synthetic data, high-dimensional data distributions, Denoising Diffusion
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Denoising Diffusion Probabilistic Models (DDPM) are powerful state-of-the-art methods used to generate synthetic data from high-dimensional data distributions and are widely used for image, audio and video generation as well as many more applications in science and beyond. The manifold hypothesis states that high-dimensional data often lie on lower-dimensional manifolds within the ambient space, and is widely believed to hold in provided examples. While recent results has provided invaluable insight into how diffusion models adapt to the manifold hypothesis, they do not capture the great empirical success of these models, making this a very fruitful research direction. In this work, we study DDPMs under the manifold hypothesis and prove that they achieve rates independent of the ambient dimension in terms of learning the score. In terms of sampling, we obtain rates independent of the ambient dimension w.r.t. the Kullback-Leibler divergence, and O(\sqrtD) w.r.t. the Wasserstein distance. We do this by developing a new framework connecting diffusion models to the well-studied theory of extrema of Gaussian Processes. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2409.18804 [stat.ML] (or arXiv:2409.18804v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2409.18804 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-118] Geometric deep learning for galaxy-halo connection: a case study for galaxy intrinsic alignments

链接: https://arxiv.org/abs/2409.18761
作者: Yesukhei Jagvaral,Francois Lanusse,Rachel Mandelbaum
关键词-EN: Rubin Observatory LSST, Forthcoming cosmological imaging, Observatory LSST, Rubin Observatory, cosmological imaging surveys
类目: Astrophysics of Galaxies (astro-ph.GA); Machine Learning (cs.LG)
*备注: 12 pages, 5 figures. submitted to MNRAS

点击查看摘要

Abstract:Forthcoming cosmological imaging surveys, such as the Rubin Observatory LSST, require large-scale simulations encompassing realistic galaxy populations for a variety of scientific applications. Of particular concern is the phenomenon of intrinsic alignments (IA), whereby galaxies orient themselves towards overdensities, potentially introducing significant systematic biases in weak gravitational lensing analyses if they are not properly modeled. Due to computational constraints, simulating the intricate details of galaxy formation and evolution relevant to IA across vast volumes is impractical. As an alternative, we propose a Deep Generative Model trained on the IllustrisTNG-100 simulation to sample 3D galaxy shapes and orientations to accurately reproduce intrinsic alignments along with correlated scalar features. We model the cosmic web as a set of graphs, each graph representing a halo with nodes representing the subhalos/galaxies. The architecture consists of a SO(3) \times \mathbbR^n diffusion generative model, for galaxy orientations and n scalars, implemented with E(3) equivariant Graph Neural Networks that explicitly respect the Euclidean symmetries of our Universe. The model is able to learn and predict features such as galaxy orientations that are statistically consistent with the reference simulation. Notably, our model demonstrates the ability to jointly model Euclidean-valued scalars (galaxy sizes, shapes, and colors) along with non-Euclidean valued SO(3) quantities (galaxy orientations) that are governed by highly complex galactic physics at non-linear scales.

[LG-119] MG-Net: Learn to Customize QAOA with Circuit Depth Awareness

链接: https://arxiv.org/abs/2409.18692
作者: Yang Qian,Xinbiao Wang,Yuxuan Du,Yong Luo,Dacheng Tao
关键词-EN: Approximate Optimization Algorithm, Quantum Approximate Optimization, combinatorial optimization challenges, tackling combinatorial optimization, variants exhibit immense
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 29 pages, 16 figures

点击查看摘要

Abstract:Quantum Approximate Optimization Algorithm (QAOA) and its variants exhibit immense potential in tackling combinatorial optimization challenges. However, their practical realization confronts a dilemma: the requisite circuit depth for satisfactory performance is problem-specific and often exceeds the maximum capability of current quantum devices. To address this dilemma, here we first analyze the convergence behavior of QAOA, uncovering the origins of this dilemma and elucidating the intricate relationship between the employed mixer Hamiltonian, the specific problem at hand, and the permissible maximum circuit depth. Harnessing this understanding, we introduce the Mixer Generator Network (MG-Net), a unified deep learning framework adept at dynamically formulating optimal mixer Hamiltonians tailored to distinct tasks and circuit depths. Systematic simulations, encompassing Ising models and weighted Max-Cut instances with up to 64 qubits, substantiate our theoretical findings, highlighting MG-Net’s superior performance in terms of both approximation ratio and efficiency.

[LG-120] owards Integrating Epistemic Uncertainty Estimation into the Radiotherapy Workflow

链接: https://arxiv.org/abs/2409.18628
作者: Marvin Tom Teichmann,Manasi Datar,Lisa Kratzke,Fernando Vega,Florin C. Ghesu
关键词-EN: contouring target structures, ensuring treatment efficacy, epistemic uncertainty estimation, uncertainty estimation, OOD detection
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Keywords: Epistemic Uncertainty - Out-of-Distribution Detection - CT Segmentation - OAR contouring - Radiotherapy

点击查看摘要

Abstract:The precision of contouring target structures and organs-at-risk (OAR) in radiotherapy planning is crucial for ensuring treatment efficacy and patient safety. Recent advancements in deep learning (DL) have significantly improved OAR contouring performance, yet the reliability of these models, especially in the presence of out-of-distribution (OOD) scenarios, remains a concern in clinical settings. This application study explores the integration of epistemic uncertainty estimation within the OAR contouring workflow to enable OOD detection in clinically relevant scenarios, using specifically compiled data. Furthermore, we introduce an advanced statistical method for OOD detection to enhance the methodological framework of uncertainty estimation. Our empirical evaluation demonstrates that epistemic uncertainty estimation is effective in identifying instances where model predictions are unreliable and may require an expert review. Notably, our approach achieves an AUC-ROC of 0.95 for OOD detection, with a specificity of 0.95 and a sensitivity of 0.92 for implant cases, underscoring its efficacy. This study addresses significant gaps in the current research landscape, such as the lack of ground truth for uncertainty estimation and limited empirical evaluations. Additionally, it provides a clinically relevant application of epistemic uncertainty estimation in an FDA-approved and widely used clinical solution for OAR segmentation from Varian, a Siemens Healthineers company, highlighting its practical benefits.

[LG-121] Robustness of AI-based weather forecasts in a changing climate

链接: https://arxiv.org/abs/2409.18529
作者: Thomas Rackow,Nikolay Koldunov,Christian Lessig,Irina Sandu,Mihai Alexe,Matthew Chantry,Mariana Clare,Jesper Dramsch,Florian Pappenberger,Xabier Pedruzo-Bagazgoitia,Steffen Tietsche,Thomas Jung
关键词-EN: machine learning models, made transformational progress, machine learning, Data-driven machine learning, learning models
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 14 pages, 4 figures

点击查看摘要

Abstract:Data-driven machine learning models for weather forecasting have made transformational progress in the last 1-2 years, with state-of-the-art ones now outperforming the best physics-based models for a wide range of skill scores. Given the strong links between weather and climate modelling, this raises the question whether machine learning models could also revolutionize climate science, for example by informing mitigation and adaptation to climate change or to generate larger ensembles for more robust uncertainty estimates. Here, we show that current state-of-the-art machine learning models trained for weather forecasting in present-day climate produce skillful forecasts across different climate states corresponding to pre-industrial, present-day, and future 2.9K warmer climates. This indicates that the dynamics shaping the weather on short timescales may not differ fundamentally in a changing climate. It also demonstrates out-of-distribution generalization capabilities of the machine learning models that are a critical prerequisite for climate applications. Nonetheless, two of the models show a global-mean cold bias in the forecasts for the future warmer climate state, i.e. they drift towards the colder present-day climate they have been trained for. A similar result is obtained for the pre-industrial case where two out of three models show a warming. We discuss possible remedies for these biases and analyze their spatial distribution, revealing complex warming and cooling patterns that are partly related to missing ocean-sea ice and land surface information in the training data. Despite these current limitations, our results suggest that data-driven machine learning models will provide powerful tools for climate science and transform established approaches by complementing conventional physics-based models.

[LG-122] Med-IC: Fusing a Single Layer Involution with Convolutions for Enhanced Medical Image Classification and Segmentation

链接: https://arxiv.org/abs/2409.18506
作者: Md. Farhadul Islam,Sarah Zabeen,Meem Arafat Manab,Mohammad Rakibul Hasan Mahin,Joyanta Jyoti Mondal,Md. Tanzim Reza,Md Zahidul Hasan,Munima Haque,Farig Sadeque,Jannatun Noor
关键词-EN: similar characteristics, resemble cells, majority of medical, medical images, cell region
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 13 pages, 5 figures, 4 tables, preprint submitted to an Elsevier journal

点击查看摘要

Abstract:The majority of medical images, especially those that resemble cells, have similar characteristics. These images, which occur in a variety of shapes, often show abnormalities in the organ or cell region. The convolution operation possesses a restricted capability to extract visual patterns across several spatial regions of an image. The involution process, which is the inverse operation of convolution, complements this inherent lack of spatial information extraction present in convolutions. In this study, we investigate how applying a single layer of involution prior to a convolutional neural network (CNN) architecture can significantly improve classification and segmentation performance, with a comparatively negligible amount of weight parameters. The study additionally shows how excessive use of involution layers might result in inaccurate predictions in a particular type of medical image. According to our findings from experiments, the strategy of adding only a single involution layer before a CNN-based model outperforms most of the previous works.

[LG-123] WHOMP: Optimizing Randomized Controlled Trials via Wasserstein Homogeneity

链接: https://arxiv.org/abs/2409.18504
作者: Shizhou Xu,Thomas Strohmer
关键词-EN: Wasserstein Homogeneity Partition, maximize diversity, minimizing dissimilarity, Wasserstein Homogeneity, Homogeneity Partition
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注: 46 pages, 3 figures

点击查看摘要

Abstract:We investigate methods for partitioning datasets into subgroups that maximize diversity within each subgroup while minimizing dissimilarity across subgroups. We introduce a novel partitioning method called the \textitWasserstein Homogeneity Partition (WHOMP), which optimally minimizes type I and type II errors that often result from imbalanced group splitting or partitioning, commonly referred to as accidental bias, in comparative and controlled trials. We conduct an analytical comparison of WHOMP against existing partitioning methods, such as random subsampling, covariate-adaptive randomization, rerandomization, and anti-clustering, demonstrating its advantages. Moreover, we characterize the optimal solutions to the WHOMP problem and reveal an inherent trade-off between the stability of subgroup means and variances among these solutions. Based on our theoretical insights, we design algorithms that not only obtain these optimal solutions but also equip practitioners with tools to select the desired trade-off. Finally, we validate the effectiveness of WHOMP through numerical experiments, highlighting its superiority over traditional methods.

[LG-124] Scientific Machine Learning Seismology

链接: https://arxiv.org/abs/2409.18397
作者: Tomohisa Okazaki
关键词-EN: Scientific machine learning, Scientific machine, interdisciplinary research field, machine learning, theory to understand
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: English translation of the manuscript submitted to Zisin (Journal of the Seismological Society of Japan)

点击查看摘要

Abstract:Scientific machine learning (SciML) is an interdisciplinary research field that integrates machine learning, particularly deep learning, with physics theory to understand and predict complex natural phenomena. By incorporating physical knowledge, SciML reduces the dependency on observational data, which is often limited in the natural sciences. In this article, the fundamental concepts of SciML, its applications in seismology, and prospects are described. Specifically, two popular methods are mainly discussed: physics-informed neural networks (PINNs) and neural operators (NOs). PINNs can address both forward and inverse problems by incorporating governing laws into the loss functions. The use of PINNs is expanding into areas such as simultaneous solutions of differential equations, inference in underdetermined systems, and regularization based on physics. These research directions would broaden the scope of deep learning in natural sciences. NOs are models designed for operator learning, which deals with relationships between infinite-dimensional spaces. NOs show promise in modeling the time evolution of complex systems based on observational or simulation data. Since large amounts of data are often required, combining NOs with physics-informed learning holds significant potential. Finally, SciML is considered from a broader perspective beyond deep learning: statistical (or mathematical) frameworks that integrate observational data with physical principles to model natural phenomena. In seismology, mathematically rigorous Bayesian statistics has been developed over the past decades, whereas more flexible and scalable deep learning has only emerged recently. Both approaches can be considered as part of SciML in a broad sense. Theoretical and practical insights in both directions would advance SciML methodologies and thereby deepen our understanding of earthquake phenomena.

[LG-125] Adaptive Learning of the Latent Space of Wasserstein Generative Adversarial Networks

链接: https://arxiv.org/abs/2409.18374
作者: Yixuan Qiu,Qingyi Gao,Xiao Wang
关键词-EN: Generative models based, latent Wasserstein GAN, generative adversarial networks, Wasserstein GAN, intrinsic dimension
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Generative models based on latent variables, such as generative adversarial networks (GANs) and variational auto-encoders (VAEs), have gained lots of interests due to their impressive performance in many fields. However, many data such as natural images usually do not populate the ambient Euclidean space but instead reside in a lower-dimensional manifold. Thus an inappropriate choice of the latent dimension fails to uncover the structure of the data, possibly resulting in mismatch of latent representations and poor generative qualities. Towards addressing these problems, we propose a novel framework called the latent Wasserstein GAN (LWGAN) that fuses the Wasserstein auto-encoder and the Wasserstein GAN so that the intrinsic dimension of the data manifold can be adaptively learned by a modified informative latent distribution. We prove that there exist an encoder network and a generator network in such a way that the intrinsic dimension of the learned encoding distribution is equal to the dimension of the data manifold. We theoretically establish that our estimated intrinsic dimension is a consistent estimate of the true dimension of the data manifold. Meanwhile, we provide an upper bound on the generalization error of LWGAN, implying that we force the synthetic data distribution to be similar to the real data distribution from a population perspective. Comprehensive empirical experiments verify our framework and show that LWGAN is able to identify the correct intrinsic dimension under several scenarios, and simultaneously generate high-quality synthetic data by sampling from the learned latent distribution.

[LG-126] A model-constrained Discontinuous Galerkin Network (DGNet) for Compressible Euler Equations with Out-of-Distribution Generalization

链接: https://arxiv.org/abs/2409.18371
作者: Hai Van Nguyen(1),Jau-Uei Chen(1),William Cole Nockolds(2),Wesley Lao(2),Tan Bui-Thanh(1 and 2) ((1) Department of Aerospace Engineering and Engineering Mechanics, the University of Texas at Austin, Texas (2) The Oden Institute for Computational Engineering and Sciences, the University of Texas at Austin, Texas)
关键词-EN: Real-time accurate solutions, digital twin contexts, complex dynamical systems, large-scale complex dynamical, Real-time accurate
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Real-time accurate solutions of large-scale complex dynamical systems are critically needed for control, optimization, uncertainty quantification, and decision-making in practical engineering and science applications, particularly in digital twin contexts. In this work, we develop a model-constrained discontinuous Galerkin Network (DGNet) approach, an extension to our previous work [Model-constrained Tagent Slope Learning Approach for Dynamical Systems], for compressible Euler equations with out-of-distribution generalization. The core of DGNet is the synergy of several key strategies: (i) leveraging time integration schemes to capture temporal correlation and taking advantage of neural network speed for computation time reduction; (ii) employing a model-constrained approach to ensure the learned tangent slope satisfies governing equations; (iii) utilizing a GNN-inspired architecture where edges represent Riemann solver surrogate models and nodes represent volume integration correction surrogate models, enabling capturing discontinuity capacity, aliasing error reduction, and mesh discretization generalizability; (iv) implementing the input normalization technique that allows surrogate models to generalize across different initial conditions, boundary conditions, and solution orders; and (v) incorporating a data randomization technique that not only implicitly promotes agreement between surrogate models and true numerical models up to second-order derivatives, ensuring long-term stability and prediction capacity, but also serves as a data generation engine during training, leading to enhanced generalization on unseen data. To validate the effectiveness, stability, and generalizability of our novel DGNet approach, we present comprehensive numerical results for 1D and 2D compressible Euler equation problems.

[LG-127] AQMLator – An Auto Quantum Machine Learning E-Platform

链接: https://arxiv.org/abs/2409.18338
作者: Tomasz Rybotycki,Piotr Gawron
关键词-EN: successful Machine Learning, Quantum Machine Learning, model implementation requires, Machine Learning, training procedure
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 15 pages, 3 figures, links to software in the text

点击查看摘要

Abstract:A successful Machine Learning (ML) model implementation requires three main components: training dataset, suitable model architecture and training procedure. Given dataset and task, finding an appropriate model might be challenging. AutoML, a branch of ML, focuses on automatic architecture search – a meta method that aims at moving human from ML system design process. The success of ML and the development of quantum computing (QC) in recent years led to a birth of new fascinating field called Quantum Machine Learning (QML) that, amongst others, incorporates quantum computers into ML models. In this paper we present AQMLator, an Auto Quantum Machine Learning platform that aims to automatically propose and train the quantum layers of an ML model with minimal input from the user. This way, data scientists can bypass the entry barrier for QC and use QML. AQMLator uses standard ML libraries, making it easy to introduce into existing ML pipelines.

[LG-128] A Framework for Standardizing Similarity Measures in a Rapidly Evolving Field

链接: https://arxiv.org/abs/2409.18333
作者: Nathan Cloos,Guangyu Robert Yang,Christopher J. Cueva
关键词-EN: Similarity measures, biological systems, Similarity, Centered Kernel Alignment, artificial and biological
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: 11 pages, 9 figures

点击查看摘要

Abstract:Similarity measures are fundamental tools for quantifying the alignment between artificial and biological systems. However, the diversity of similarity measures and their varied naming and implementation conventions makes it challenging to compare across studies. To facilitate comparisons and make explicit the implementation choices underlying a given code package, we have created and are continuing to develop a Python repository that benchmarks and standardizes similarity measures. The goal of creating a consistent naming convention that uniquely and efficiently specifies a similarity measure is not trivial as, for example, even commonly used methods like Centered Kernel Alignment (CKA) have at least 12 different variations, and this number will likely continue to grow as the field evolves. For this reason, we do not advocate for a fixed, definitive naming convention. The landscape of similarity measures and best practices will continue to change and so we see our current repository, which incorporates approximately 100 different similarity measures from 14 packages, as providing a useful tool at this snapshot in time. To accommodate the evolution of the field we present a framework for developing, validating, and refining naming conventions with the goal of uniquely and efficiently specifying similarity measures, ultimately making it easier for the community to make comparisons across studies.

[LG-129] Local Prediction-Powered Inference

链接: https://arxiv.org/abs/2409.18321
作者: Yanwu Gu,Dong Xia
关键词-EN: assign higher weights, called local polynomial, points closer, infer a function, essential to assign
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:To infer a function value on a specific point x , it is essential to assign higher weights to the points closer to x , which is called local polynomial / multivariable regression. In many practical cases, a limited sample size may ruin this method, but such conditions can be improved by the Prediction-Powered Inference (PPI) technique. This paper introduced a specific algorithm for local multivariable regression using PPI, which can significantly reduce the variance of estimations without enlarge the error. The confidence intervals, bias correction, and coverage probabilities are analyzed and proved the correctness and superiority of our algorithm. Numerical simulation and real-data experiments are applied and show these conclusions. Another contribution compared to PPI is the theoretical computation efficiency and explainability by taking into account the dependency of the dependent variable.

[LG-130] Deep-ER: Deep Learning ECCENTRIC Reconstruction for fast high-resolution neurometabolic imaging

链接: https://arxiv.org/abs/2409.18303
作者: Paul Weiser,Georg Langs,Wolfgang Bogner,Stanislav Motyka,Bernhard Strasser,Polina Golland,Nalini Singh,Jorg Dietrich,Erik Uhlmann,Tracy Batchelor,Daniel Cahill,Malte Hoffmann,Antoine Klauser,Ovidiu C. Andronesi
关键词-EN: Magnetic Resonance Spectroscopic, Resonance Spectroscopic Imaging, Magnetic Resonance, Resonance Spectroscopic, important pathological mechanism
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Introduction: Altered neurometabolism is an important pathological mechanism in many neurological diseases and brain cancer, which can be mapped non-invasively by Magnetic Resonance Spectroscopic Imaging (MRSI). Advanced MRSI using non-cartesian compressed-sense acquisition enables fast high-resolution metabolic imaging but has lengthy reconstruction times that limits throughput and needs expert user interaction. Here, we present a robust and efficient Deep Learning reconstruction to obtain high-quality metabolic maps. Methods: Fast high-resolution whole-brain metabolic imaging was performed at 3.4 mm ^3 isotropic resolution with acquisition times between 4:11-9:21 min:s using ECCENTRIC pulse sequence on a 7T MRI scanner. Data were acquired in a high-resolution phantom and 27 human participants, including 22 healthy volunteers and 5 glioma patients. A deep neural network using recurring interlaced convolutional layers with joint dual-space feature representation was developed for deep learning ECCENTRIC reconstruction (Deep-ER). 21 subjects were used for training and 6 subjects for testing. Deep-ER performance was compared to conventional iterative Total Generalized Variation reconstruction using image and spectral quality metrics. Results: Deep-ER demonstrated 600-fold faster reconstruction than conventional methods, providing improved spatial-spectral quality and metabolite quantification with 12%-45% (P0.05) higher signal-to-noise and 8%-50% (P0.05) smaller Cramer-Rao lower bounds. Metabolic images clearly visualize glioma tumor heterogeneity and boundary. Conclusion: Deep-ER provides efficient and robust reconstruction for sparse-sampled MRSI. The accelerated acquisition-reconstruction MRSI is compatible with high-throughput imaging workflow. It is expected that such improved performance will facilitate basic and clinical MRSI applications. Subjects: Image and Video Processing (eess.IV); Machine Learning (cs.LG) Cite as: arXiv:2409.18303 [eess.IV] (or arXiv:2409.18303v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2409.18303 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Paul Weiser [view email] [v1] Thu, 26 Sep 2024 21:20:51 UTC (10,376 KB)

[LG-131] Predicting Muscle Thickness Deformation from Muscle Activation Patterns: A Dual-Attention Framework

链接: https://arxiv.org/abs/2409.18266
作者: Bangyu Lan,Kenan Niu
关键词-EN: diagnosing muscle-related diseases, Understanding the relationship, critical for diagnosing, diagnosing muscle-related, muscle-related diseases
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding the relationship between muscle activation and thickness deformation is critical for diagnosing muscle-related diseases and monitoring muscle health. Although ultrasound technique can measure muscle thickness change during muscle movement, its application in portable devices is limited by wiring and data collection challenges. Surface electromyography (sEMG), on the other hand, records muscle bioelectrical signals as the muscle activation. This paper introduced a deep-learning approach to leverage sEMG signals for muscle thickness deformation prediction, eliminating the need for ultrasound measurement. Using a dual-attention framework combining self-attention and cross-attention mechanisms, this method predicted muscle deformation directly from sEMG data. Experimental results with six healthy subjects showed that the approach could accurately predict muscle excursion with an average precision of 0.923 \pm 0.900mm, which shows that this method can facilitate real-time portable muscle health monitoring, showing potential for applications in clinical diagnostics, sports science, and rehabilitation.

[LG-132] A Unified View on Learning Unnormalized Distributions via Noise-Contrastive Estimation

链接: https://arxiv.org/abs/2409.18209
作者: J. Jon Ryu,Abhin Shah,Gregory W. Wornell
关键词-EN: learning unnormalized distributions, unnormalized distributions, learning unnormalized, noise-contrastive estimation, paper studies
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 35 pages

点击查看摘要

Abstract:This paper studies a family of estimators based on noise-contrastive estimation (NCE) for learning unnormalized distributions. The main contribution of this work is to provide a unified perspective on various methods for learning unnormalized distributions, which have been independently proposed and studied in separate research communities, through the lens of NCE. This unified view offers new insights into existing estimators. Specifically, for exponential families, we establish the finite-sample convergence rates of the proposed estimators under a set of regularity assumptions, most of which are new.

[LG-133] Loop-Diffusion: an equivariant diffusion model for designing and scoring protein loops

链接: https://arxiv.org/abs/2409.18201
作者: Kevin Borisiak,Gian Marco Visani,Armita Nourmohammad
关键词-EN: Predicting protein functional, designing novel therapeutics, Predicting protein, characteristics from structure, structure remains
类目: Biological Physics (physics.bio-ph); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Predicting protein functional characteristics from structure remains a central problem in protein science, with broad implications from understanding the mechanisms of disease to designing novel therapeutics. Unfortunately, current machine learning methods are limited by scarce and biased experimental data, and physics-based methods are either too slow to be useful, or too simplified to be accurate. In this work, we present Loop-Diffusion, an energy based diffusion model which leverages a dataset of general protein loops from the entire protein universe to learn an energy function that generalizes to functional prediction tasks. We evaluate Loop-Diffusion’s performance on scoring TCR-pMHC interfaces and demonstrate state-of-the-art results in recognizing binding-enhancing mutations.

[LG-134] Decomposable Transformer Point Processes NEURIPS2024

链接: https://arxiv.org/abs/2409.18158
作者: Aristeidis Panos
关键词-EN: marked point processes, modeling marked point, thinning algorithm, standard paradigm, marked point
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: accepted at NeurIPS 2024

点击查看摘要

Abstract:The standard paradigm of modeling marked point processes is by parameterizing the intensity function using an attention-based (Transformer-style) architecture. Despite the flexibility of these methods, their inference is based on the computationally intensive thinning algorithm. In this work, we propose a framework where the advantages of the attention-based architecture are maintained and the limitation of the thinning algorithm is circumvented. The framework depends on modeling the conditional distribution of inter-event times with a mixture of log-normals satisfying a Markov property and the conditional probability mass function for the marks with a Transformer-based architecture. The proposed method attains state-of-the-art performance in predicting the next event of a sequence given its history. The experiments also reveal the efficacy of the methods that do not rely on the thinning algorithm during inference over the ones they do. Finally, we test our method on the challenging long-horizon prediction task and find that it outperforms a baseline developed specifically for tackling this task; importantly, inference requires just a fraction of time compared to the thinning-based baseline.

[LG-135] A novel application of Shapley values for large multidimensional time-series data: Applying explainable AI to a DNA profile classification neural network

链接: https://arxiv.org/abs/2409.18156
作者: Lauren Elborough,Duncan Taylor,Melissa Humphries
关键词-EN: computationally challenging, Shapley, DNA, data, classification
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Genomics (q-bio.GN); Machine Learning (stat.ML)
*备注: 16 pages, 5 figures

点击查看摘要

Abstract:The application of Shapley values to high-dimensional, time-series-like data is computationally challenging - and sometimes impossible. For N inputs the problem is 2^N hard. In image processing, clusters of pixels, referred to as superpixels, are used to streamline computations. This research presents an efficient solution for time-seres-like data that adapts the idea of superpixels for Shapley value computation. Motivated by a forensic DNA classification example, the method is applied to multivariate time-series-like data whose features have been classified by a convolutional neural network (CNN). In DNA processing, it is important to identify alleles from the background noise created by DNA extraction and processing. A single DNA profile has 31,200 scan points to classify, and the classification decisions must be defensible in a court of law. This means that classification is routinely performed by human readers - a monumental and time consuming process. The application of a CNN with fast computation of meaningful Shapley values provides a potential alternative to the classification. This research demonstrates the realistic, accurate and fast computation of Shapley values for this massive task

[LG-136] A shortest-path based clustering algorithm for joint human-machine analysis of complex datasets

链接: https://arxiv.org/abs/1812.11850
作者: Diego Ulisse Pizzagalli,Santiago Fernandez Gonzalez,Rolf Krause
关键词-EN: biomedical research, obtained by empirical, empirical studies, major application, application for biomedical
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Clustering is a technique for the analysis of datasets obtained by empirical studies in several disciplines with a major application for biomedical research. Essentially, clustering algorithms are executed by machines aiming at finding groups of related points in a dataset. However, the result of grouping depends on both metrics for point-to-point similarity and rules for point-to-group association. Indeed, non-appropriate metrics and rules can lead to undesirable clustering artifacts. This is especially relevant for datasets, where groups with heterogeneous structures co-exist. In this work, we propose an algorithm that achieves clustering by exploring the paths between points. This allows both, to evaluate the properties of the path (such as gaps, density variations, etc.), and expressing the preference for certain paths. Moreover, our algorithm supports the integration of existing knowledge about admissible and non-admissible clusters by training a path classifier. We demonstrate the accuracy of the proposed method on challenging datasets including points from synthetic shapes in publicly available benchmarks and microscopy data.

信息检索

[IR-0] LML: Language Model Learning a Dataset for Data-Augmented Prediction

链接: https://arxiv.org/abs/2409.18957
作者: Praneeth Vadlapati
关键词-EN: Large Language Models, Language Model Learning, Large Language, Machine Learning Model, Explainable Machine Learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: First version

点击查看摘要

Abstract:This paper introduces a new approach to using Large Language Models (LLMs) for classification tasks, which are typically handled using Machine Learning (ML) models. Unlike ML models that rely heavily on data cleaning and feature engineering, this method streamlines the process using LLMs. This paper proposes a new concept called “Language Model Learning (LML)” powered by a new method called “Data-Augmented Prediction (DAP)”. The classification is performed by LLMs using a method similar to humans manually exploring and understanding the data and deciding classifications using data as a reference. Training data is summarized and evaluated to determine the features that lead to the classification of each label the most. In the process of DAP, the system uses the data summary to automatically create a query, which is used to retrieve relevant rows from the dataset. A classification is generated by the LLM using data summary and relevant rows, ensuring satisfactory accuracy even with complex data. Usage of data summary and similar data in DAP ensures context-aware decision-making. The proposed method uses the words “Act as an Explainable Machine Learning Model” in the prompt to enhance the interpretability of the predictions by allowing users to review the logic behind each prediction. In some test cases, the system scored an accuracy above 90%, proving the effectiveness of the system and its potential to outperform conventional ML models in various scenarios. The code is available at this https URL

[IR-1] Suicide Phenotyping from Clinical Notes in Safety-Net Psychiatric Hospital Using Multi-Label Classification with Pre-Trained Language Models

链接: https://arxiv.org/abs/2409.18878
作者: Zehan Li,Yan Hu,Scott Lane,Salih Selek,Lokesh Shahani,Rodrigo Machado-Vieira,Jair Soares,Hua Xu,Hongfang Liu,Ming Huang
关键词-EN: reducing operational burden, improving care quality, Accurate identification, high-acuity psychiatric settings, reducing operational
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注: submitted to AMIA Informatics Summit 2025 as a conference paper

点击查看摘要

Abstract:Accurate identification and categorization of suicidal events can yield better suicide precautions, reducing operational burden, and improving care quality in high-acuity psychiatric settings. Pre-trained language models offer promise for identifying suicidality from unstructured clinical narratives. We evaluated the performance of four BERT-based models using two fine-tuning strategies (multiple single-label and single multi-label) for detecting coexisting suicidal events from 500 annotated psychiatric evaluation notes. The notes were labeled for suicidal ideation (SI), suicide attempts (SA), exposure to suicide (ES), and non-suicidal self-injury (NSSI). RoBERTa outperformed other models using binary relevance (acc=0.86, F1=0.78). MentalBERT (F1=0.74) also exceeded BioClinicalBERT (F1=0.72). RoBERTa fine-tuned with a single multi-label classifier further improved performance (acc=0.88, F1=0.81), highlighting that models pre-trained on domain-relevant data and the single multi-label classification strategy enhance efficiency and performance. Keywords: EHR-based Phynotyping; Natural Language Processing; Secondary Use of EHR Data; Suicide Classification; BERT-based Model; Psychiatry; Mental Health Comments: submitted to AMIA Informatics Summit 2025 as a conference paper Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR) Cite as: arXiv:2409.18878 [cs.CL] (or arXiv:2409.18878v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.18878 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-2] Cross-Domain Keyword Extraction with Keyness Patterns

链接: https://arxiv.org/abs/2409.18724
作者: Dongmei Zhou,Xuri Tang
关键词-EN: subjectivity pose challenges, supervised keyword extraction, keyword extraction, annotation subjectivity pose, keyness patterns
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
*备注: 26 pages, 14 figures

点击查看摘要

Abstract:Domain dependence and annotation subjectivity pose challenges for supervised keyword extraction. Based on the premises that second-order keyness patterns are existent at the community level and learnable from annotated keyword extraction datasets, this paper proposes a supervised ranking approach to keyword extraction that ranks keywords with keyness patterns consisting of independent features (such as sublanguage domain and term length) and three categories of dependent features – heuristic features, specificity features, and representavity features. The approach uses two convolutional-neural-network based models to learn keyness patterns from keyword datasets and overcomes annotation subjectivity by training the two models with bootstrap sampling strategy. Experiments demonstrate that the approach not only achieves state-of-the-art performance on ten keyword datasets in general supervised keyword extraction with an average top-10-F-measure of 0.316 , but also robust cross-domain performance with an average top-10-F-measure of 0.346 on four datasets that are excluded in the training process. Such cross-domain robustness is attributed to the fact that community-level keyness patterns are limited in number and temperately independent of language domains, the distinction between independent features and dependent features, and the sampling training strategy that balances excess risk and lack of negative training data.

[IR-3] Scalable Cross-Entropy Loss for Sequential Recommendations with Large Item Catalogs RECSYS’24

链接: https://arxiv.org/abs/2409.18721
作者: Gleb Mezentsev,Danil Gusak,Ivan Oseledets,Evgeny Frolov
关键词-EN: Scalability issue plays, modern recommender systems, productionizing modern recommender, Scalability issue, recommender systems
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 11 pages, accepted for RecSys’24

点击查看摘要

Abstract:Scalability issue plays a crucial role in productionizing modern recommender systems. Even lightweight architectures may suffer from high computational overload due to intermediate calculations, limiting their practicality in real-world applications. Specifically, applying full Cross-Entropy (CE) loss often yields state-of-the-art performance in terms of recommendations quality. Still, it suffers from excessive GPU memory utilization when dealing with large item catalogs. This paper introduces a novel Scalable Cross-Entropy (SCE) loss function in the sequential learning setup. It approximates the CE loss for datasets with large-size catalogs, enhancing both time efficiency and memory usage without compromising recommendations quality. Unlike traditional negative sampling methods, our approach utilizes a selective GPU-efficient computation strategy, focusing on the most informative elements of the catalog, particularly those most likely to be false positives. This is achieved by approximating the softmax distribution over a subset of the model outputs through the maximum inner product search. Experimental results on multiple datasets demonstrate the effectiveness of SCE in reducing peak memory usage by a factor of up to 100 compared to the alternatives, retaining or even exceeding their metrics values. The proposed approach also opens new perspectives for large-scale developments in different domains, such as large language models.

[IR-4] Less is More: Towards Sustainability-Aware Persuasive Explanations in Recommender Systems RECSYS2024

链接: https://arxiv.org/abs/2409.18690
作者: Thi Ngoc Trang Tran,Seda Polat Erdeniz,Alexander Felfernig,Sebastian Lubos,Merfat El-Mansi,Viet-Man Le
关键词-EN: United Nations sustainable, Nations sustainable development, United Nations, Recommender systems play, sustainable development goals
类目: Information Retrieval (cs.IR)
*备注: The paper was accepted for publication and will be presented in the LBR track of RecSys 2024, 14.- 18. October 2024, Bari, Italy

点击查看摘要

Abstract:Recommender systems play an important role in supporting the achievement of the United Nations sustainable development goals (SDGs). In recommender systems, explanations can support different goals, such as increasing a user’s trust in a recommendation, persuading a user to purchase specific items, or increasing the understanding of the reasons behind a recommendation. In this paper, we discuss the concept of “sustainability-aware persuasive explanations” which we regard as a major concept to support the achievement of the mentioned SDGs. Such explanations are orthogonal to most existing explanation approaches since they focus on a “less is more” principle, which per se is not included in existing e-commerce platforms. Based on a user study in three item domains, we analyze the potential impacts of sustainability-aware persuasive explanations. The study results are promising regarding user acceptance and the potential impacts of such explanations.

[IR-5] Explainable Enrichment-Driven GrAph Reasoner (EDGAR) for Large Knowledge Graphs with Applications in Drug Repurposing

链接: https://arxiv.org/abs/2409.18659
作者: Olawumi Olasunkanmi,Evan Morris,Yaphet Kebede,Harlin Lee,Stanley Ahalt,Alexander Tropsha,Chris Bizon
关键词-EN: Knowledge graphs, represent connections, real-world entities, connections and relationships, relationships between real-world
类目: Information Theory (cs.IT); Information Retrieval (cs.IR)
*备注: 10 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Knowledge graphs (KGs) represent connections and relationships between real-world entities. We propose a link prediction framework for KGs named Enrichment-Driven GrAph Reasoner (EDGAR), which infers new edges by mining entity-local rules. This approach leverages enrichment analysis, a well-established statistical method used to identify mechanisms common to sets of differentially expressed genes. EDGAR’s inference results are inherently explainable and rankable, with p-values indicating the statistical significance of each enrichment-based rule. We demonstrate the framework’s effectiveness on a large-scale biomedical KG, ROBOKOP, focusing on drug repurposing for Alzheimer disease (AD) as a case study. Initially, we extracted 14 known drugs from the KG and identified 20 contextual biomarkers through enrichment analysis, revealing functional pathways relevant to shared drug efficacy for AD. Subsequently, using the top 1000 enrichment results, our system identified 1246 additional drug candidates for AD treatment. The top 10 candidates were validated using evidence from medical literature. EDGAR is deployed within ROBOKOP, complete with a web user interface. This is the first study to apply enrichment analysis to large graph completion and drug repurposing. Comments: 10 pages, 5 figures, 4 tables Subjects: Information Theory (cs.IT); Information Retrieval (cs.IR) MSC classes: 68P20 ACMclasses: H.3.4 Cite as: arXiv:2409.18659 [cs.IT] (or arXiv:2409.18659v1 [cs.IT] for this version) https://doi.org/10.48550/arXiv.2409.18659 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-6] Corpus-informed Retrieval Augmented Generation of Clarifying Questions

链接: https://arxiv.org/abs/2409.18575
作者: Antonios Minas Krasakis,Andrew Yates,Evangelos Kanoulas
关键词-EN: Retrieval Augmented Language, informed clarifying questions, Augmented Language Models, generate corpus informed, corpus informed clarifying
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This study aims to develop models that generate corpus informed clarifying questions for web search, in a way that ensures the questions align with the available information in the retrieval corpus. We demonstrate the effectiveness of Retrieval Augmented Language Models (RAG) in this process, emphasising their ability to (i) jointly model the user query and retrieval corpus to pinpoint the uncertainty and ask for clarifications end-to-end and (ii) model more evidence documents, which can be used towards increasing the breadth of the questions asked. However, we observe that in current datasets search intents are largely unsupported by the corpus, which is problematic both for training and evaluation. This causes question generation models to ``hallucinate’', ie. suggest intents that are not in the corpus, which can have detrimental effects in performance. To address this, we propose dataset augmentation methods that align the ground truth clarifications with the retrieval corpus. Additionally, we explore techniques to enhance the relevance of the evidence pool during inference, but find that identifying ground truth intents within the corpus remains challenging. Our analysis suggests that this challenge is partly due to the bias of current datasets towards clarification taxonomies and calls for data that can support generating corpus-informed clarifications.

[IR-7] Decomposing the Jaccard Distance and the Jaccard Index in ABCDE

链接: https://arxiv.org/abs/2409.18522
作者: Stephan van Staden
关键词-EN: clustering, evaluating differences, clustering diff, Quality metrics, large clusterings
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:ABCDE is a sophisticated technique for evaluating differences between very large clusterings. Its main metric that characterizes the magnitude of the difference between two clusterings is the JaccardDistance, which is a true distance metric in the space of all clusterings of a fixed set of (weighted) items. The JaccardIndex is the complementary metric that characterizes the similarity of two clusterings. Its relationship with the JaccardDistance is simple: JaccardDistance + JaccardIndex = 1. This paper decomposes the JaccardDistance and the JaccardIndex further. In each case, the decomposition yields Impact and Quality metrics. The Impact metrics measure aspects of the magnitude of the clustering diff, while Quality metrics use human judgements to measure how much the clustering diff improves the quality of the clustering. The decompositions of this paper offer more and deeper insight into a clustering change. They also unlock new techniques for debugging and exploring the nature of the clustering diff. The new metrics are mathematically well-behaved and they are interrelated via simple equations. While the work can be seen as an alternative formal framework for ABCDE, we prefer to view it as complementary. It certainly offers a different perspective on the magnitude and the quality of a clustering change, and users can use whatever they want from each approach to gain more insight into a change.

[IR-8] Do We Need Domain-Specific Embedding Models? An Empirical Investigation

链接: https://arxiv.org/abs/2409.18511
作者: Yixuan Tang,Yi Yang
关键词-EN: Text Embedding Benchmark, NLP applications, Massive Text Embedding, Embedding models, Embedding
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: this https URL

点击查看摘要

Abstract:Embedding models play a crucial role in representing and retrieving information across various NLP applications. Recent advancements in Large Language Models (LLMs) have further enhanced the performance of embedding models, which are trained on massive amounts of text covering almost every domain. These models are often benchmarked on general-purpose datasets like Massive Text Embedding Benchmark (MTEB), where they demonstrate superior performance. However, a critical question arises: Is the development of domain-specific embedding models necessary when general-purpose models are trained on vast corpora that already include specialized domain texts? In this paper, we empirically investigate this question, choosing the finance domain as an example. We introduce the Finance Massive Text Embedding Benchmark (FinMTEB), a counterpart to MTEB that consists of financial domain-specific text datasets. We evaluate the performance of seven state-of-the-art embedding models on FinMTEB and observe a significant performance drop compared to their performance on MTEB. To account for the possibility that this drop is driven by FinMTEB’s higher complexity, we propose four measures to quantify dataset complexity and control for this factor in our analysis. Our analysis provides compelling evidence that state-of-the-art embedding models struggle to capture domain-specific linguistic and semantic patterns, even when trained on large general-purpose corpora. This study sheds light on the necessity of developing domain-specific embedding models in the LLM era, offering valuable insights for researchers and practitioners.

[IR-9] Efficient Top-k s-Biplexes Search over Large Bipartite Graphs

链接: https://arxiv.org/abs/2409.18473
作者: Zhenxiang Xu,Yiping Liu,Yi Zhou,Yimin Hao,Zhengren Wang
关键词-EN: subgraph is adjacent, opposite set, bipartite graph, bipartite graph analysis, subgraph
类目: Information Retrieval (cs.IR); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:In a bipartite graph, a subgraph is an s -biplex if each vertex of the subgraph is adjacent to all but at most s vertices on the opposite set. The enumeration of s -biplexes from a given graph is a fundamental problem in bipartite graph analysis. However, in real-world data engineering, finding all s -biplexes is neither necessary nor computationally affordable. A more realistic problem is to identify some of the largest s -biplexes from the large input graph. We formulate the problem as the \em top- k s -biplex search (TBS) problem, which aims to find the top- k maximal s -biplexes with the most vertices, where k is an input parameter. We prove that the TBS problem is NP-hard for any fixed k\ge 1 . Then, we propose a branching algorithm, named MVBP, that breaks the simple 2^n enumeration algorithm. Furthermore, from a practical perspective, we investigate three techniques to improve the performance of MVBP: 2-hop decomposition, single-side bounds, and progressive search. Complexity analysis shows that the improved algorithm, named FastMVBP, has a running time O^*(\gamma_s^d_2) , where \gamma_s2 , and d_2 is a parameter much smaller than the number of vertex in the sparse real-world graphs, e.g. d_2 is only 67 in the AmazonRatings dataset which has more than 3 million vertices. Finally, we conducted extensive experiments on eight real-world and synthetic datasets to demonstrate the empirical efficiency of the proposed algorithms. In particular, FastMVBP outperforms the benchmark algorithms by up to three orders of magnitude in several instances.

[IR-10] Neural Collaborative Filtering to Detect Anomalies in Human Semantic Trajectories

链接: https://arxiv.org/abs/2409.18427
作者: Yueyang Liu,Lance Kennedy,Hossein Amiri,Andreas Züfle
关键词-EN: including security surveillance, trajectory anomaly detection, trajectory anomaly, anomaly detection, range of applications
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注: Accepted for publication in the 1st ACM SIGSPATIAL International Workshop on Geospatial Anomaly Detection (GeoAnomalies’24)

点击查看摘要

Abstract:Human trajectory anomaly detection has become increasingly important across a wide range of applications, including security surveillance and public health. However, existing trajectory anomaly detection methods are primarily focused on vehicle-level traffic, while human-level trajectory anomaly detection remains under-explored. Since human trajectory data is often very sparse, machine learning methods have become the preferred approach for identifying complex patterns. However, concerns regarding potential biases and the robustness of these models have intensified the demand for more transparent and explainable alternatives. In response to these challenges, our research focuses on developing a lightweight anomaly detection model specifically designed to detect anomalies in human trajectories. We propose a Neural Collaborative Filtering approach to model and predict normal mobility. Our method is designed to model users’ daily patterns of life without requiring prior knowledge, thereby enhancing performance in scenarios where data is sparse or incomplete, such as in cold start situations. Our algorithm consists of two main modules. The first is the collaborative filtering module, which applies collaborative filtering to model normal mobility of individual humans to places of interest. The second is the neural module, responsible for interpreting the complex spatio-temporal relationships inherent in human trajectory data. To validate our approach, we conducted extensive experiments using simulated and real-world datasets comparing to numerous state-of-the-art trajectory anomaly detection approaches.

[IR-11] Generative Retrieval Meets Multi-Graded Relevance NEURIPS2024

链接: https://arxiv.org/abs/2409.18409
作者: Yubao Tang,Ruqing Zhang,Jiafeng Guo,Maarten de Rijke,Wei Chen,Xueqi Cheng
关键词-EN: Generative retrieval, relevance, retrieval, Generative retrieval represents, relevant
类目: Information Retrieval (cs.IR)
*备注: Accepted by the NeurIPS 2024 (Spotlight)

点击查看摘要

Abstract:Generative retrieval represents a novel approach to information retrieval. It uses an encoder-decoder architecture to directly produce relevant document identifiers (docids) for queries. While this method offers benefits, current approaches are limited to scenarios with binary relevance data, overlooking the potential for documents to have multi-graded relevance. Extending generative retrieval to accommodate multi-graded relevance poses challenges, including the need to reconcile likelihood probabilities for docid pairs and the possibility of multiple relevant documents sharing the same identifier. To address these challenges, we introduce a framework called GRaded Generative Retrieval (GR ^2 ). GR ^2 focuses on two key components: ensuring relevant and distinct identifiers, and implementing multi-graded constrained contrastive training. First, we create identifiers that are both semantically relevant and sufficiently distinct to represent individual documents effectively. This is achieved by jointly optimizing the relevance and distinctness of docids through a combination of docid generation and autoencoder models. Second, we incorporate information about the relationship between relevance grades to guide the training process. We use a constrained contrastive training strategy to bring the representations of queries and the identifiers of their relevant documents closer together, based on their respective relevance grades. Extensive experiments on datasets with both multi-graded and binary relevance demonstrate the effectiveness of GR ^2 .

[IR-12] racking Software Security Topics

链接: https://arxiv.org/abs/2409.18351
作者: Phong Minh Vu,Tung Thanh Nguyen
关键词-EN: incidents occur everyday, Software security, Software security incidents, security incidents occur, software security reports
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Software security incidents occur everyday and thousands of software security reports are announced each month. Thus, it is difficult for software security researchers, engineers, and other stakeholders to follow software security topics of their interests in real-time. In this paper, we propose, SOSK, a novel tool for this problem. SOSK allows a user to import a collection of software security reports. It pre-processes and extracts the most important keywords from the textual description of the reports. Based on the similarity of embedding vectors of keywords, SOSK can expand and/or refine a keyword set from a much smaller set of user-provided keywords. Thus, SOSK allows users to define any topic of their interests and retrieve security reports relevant to that topic effectively. Our preliminary evaluation shows that SOSK can expand keywords and retrieve reports relevant to user requests.

[IR-13] Evaluation of Cluster Id Assignment Schemes with ABCDE

链接: https://arxiv.org/abs/2409.18254
作者: Stephan van Staden
关键词-EN: cluster, assignment scheme labels, assignment, assignment schemes, clustering
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:A cluster id assignment scheme labels each cluster of a clustering with a distinct id. The goal of id assignment is semantic id stability, which means that, whenever possible, a cluster for the same underlying concept as that of a historical cluster should ideally receive the same id as the historical cluster. Semantic id stability allows the users of a clustering to refer to a concept’s cluster with an id that is stable across clusterings/time. This paper treats the problem of evaluating the relative merits of id assignment schemes. In particular, it considers a historical clustering with id assignments, and a new clustering with ids assigned by a baseline and an experiment. It produces metrics that characterize both the magnitude and the quality of the id assignment diffs between the baseline and the experiment. That happens by transforming the problem of cluster id assignment into a problem of cluster membership, and evaluating it with ABCDE. ABCDE is a sophisticated and scalable technique for evaluating differences in cluster membership in real-world applications, where billions of items are grouped into millions of clusters, and some items are more important than others. The paper also describes several generalizations to the basic evaluation setup for id assignment schemes. For example, it is fairly straightforward to evaluate changes that simultaneously mutate cluster memberships and cluster ids. The ideas are generously illustrated with examples.

附件下载

点击下载今日全部论文列表