Arxiv今日论文 | 2024-11-07

本篇博文主要展示 2024-11-07 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决的问题是：在医学领域，通过领域适应预训练（Domain-Adaptive Pretraining, DAPT）对通用大型语言模型（LLMs）和视觉-语言模型（VLMs）进行继续预训练，是否能显著提升其在医学问答任务中的表现。解决方案的关键在于：通过直接对比医学专用模型与其对应的基线模型，优化每个模型的提示（prompt），并考虑统计不确定性，发现医学专用模型在零样本/少样本提示机制下，并未能持续优于基线模型。具体而言，在三样本设置下，医学LLMs仅在12.1%的情况下优于基线模型，49.8%的情况下持平，38.2%的情况下表现更差。这一结论挑战了现有文献中关于DAPT在医学领域有效性的普遍观点，并建议未来研究应更加严谨地评估模型性能。

链接: https://arxiv.org/abs/2411.04118
作者: Daniel P. Jeong,Saurabh Garg,Zachary C. Lipton,Michael Oberst
关键词-EN: adapting general-purpose large, general-purpose large language, recent works seek, develop foundation models, foundation models specifically
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to EMNLP 2024 Main Conference as Long Paper (Oral)

点击查看摘要

Abstract:Several recent works seek to develop foundation models specifically for medical applications, adapting general-purpose large language models (LLMs) and vision-language models (VLMs) via continued pretraining on publicly available biomedical corpora. These works typically claim that such domain-adaptive pretraining (DAPT) improves performance on downstream medical tasks, such as answering medical licensing exam questions. In this paper, we compare seven public “medical” LLMs and two VLMs against their corresponding base models, arriving at a different conclusion: all medical VLMs and nearly all medical LLMs fail to consistently improve over their base models in the zero-/few-shot prompting regime for medical question-answering (QA) tasks. For instance, across the tasks and model pairs we consider in the 3-shot setting, medical LLMs only outperform their base models in 12.1% of cases, reach a (statistical) tie in 49.8% of cases, and are significantly worse than their base models in the remaining 38.2% of cases. Our conclusions are based on (i) comparing each medical model head-to-head, directly against the corresponding base model; (ii) optimizing the prompts for each model separately; and (iii) accounting for statistical uncertainty in comparisons. While these basic practices are not consistently adopted in the literature, our ablations show that they substantially impact conclusions. Our findings suggest that state-of-the-art general-domain models may already exhibit strong medical knowledge and reasoning capabilities, and offer recommendations to strengthen the conclusions of future studies.
摘要：近期有几项研究致力于开发专门用于医疗领域的基础模型，通过在公开的生物医学语料库上继续预训练通用大语言模型 (LLM) 和视觉-语言模型 (VLM) 来实现这一目标。这些研究通常声称，这种领域自适应预训练 (DAPT) 能够提升下游医疗任务的性能，例如回答医学执照考试问题。在本文中，我们对比了七种公开的“医疗”LLM 和两种 VLM 与其对应的基础模型，得出了不同的结论：在零样本/少样本提示条件下，所有医疗 VLM 和几乎所有医疗 LLM 在医疗问答 (QA) 任务中均未能持续超越其基础模型。例如，在我们考虑的 3-shot 设置下，医疗 LLM 仅在 12.1% 的情况下优于其基础模型，在 49.8% 的情况下达到（统计上的）平局，而在剩余 38.2% 的情况下显著劣于其基础模型。我们的结论基于以下几点：(i) 直接将每个医疗模型与其对应的基础模型进行一对一比较；(ii) 分别为每个模型优化提示；(iii) 考虑比较中的统计不确定性。尽管这些基本实践在文献中并未得到一致采用，但我们的消融实验表明它们对结论有显著影响。我们的研究结果表明，最先进的通用领域模型可能已经展现出强大的医学知识和推理能力，并提出了加强未来研究结论的建议。

[NLP-1] Self-Consistency Preference Optimization

【速读】：该论文试图解决自对齐（self-alignment）技术在复杂推理任务中难以提升模型性能的问题，特别是由于难以分配正确奖励而导致的改进困难。解决方案的关键在于引入自一致性偏好优化（Self-Consistency Preference Optimization, ScPO），这是一种将自一致性概念扩展到模型训练阶段的方法。ScPO通过迭代训练，使得模型在无监督的新问题上更倾向于选择一致的答案，从而在推理任务（如GSM8K和MATH）上显著提升性能，缩小了与使用黄金答案或偏好进行监督训练的差距。此外，结合ScPO与标准监督学习进一步提升了结果，在ZebraLogic任务上，ScPO微调的Llama-3 8B模型表现优于Llama-3 70B、Gemma-2 27B和Claude-3 Haiku。

链接: https://arxiv.org/abs/2411.04109
作者: Archiki Prasad,Weizhe Yuan,Richard Yuanzhe Pang,Jing Xu,Maryam Fazel-Zarandi,Mohit Bansal,Sainbayar Sukhbaatar,Jason Weston,Jane Yu
关键词-EN: growing research area, rapidly growing research, human annotation, research area, rapidly growing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 3 figures

点击查看摘要

Abstract:Self-alignment, whereby models learn to improve themselves without human annotation, is a rapidly growing research area. However, existing techniques often fail to improve complex reasoning tasks due to the difficulty of assigning correct rewards. An orthogonal approach that is known to improve correctness is self-consistency, a method applied at inference time based on multiple sampling in order to find the most consistent answer. In this work, we extend the self-consistency concept to help train models. We thus introduce self-consistency preference optimization (ScPO), which iteratively trains consistent answers to be preferred over inconsistent ones on unsupervised new problems. We show ScPO leads to large improvements over conventional reward model training on reasoning tasks such as GSM8K and MATH, closing the gap with supervised training with gold answers or preferences, and that combining ScPO with standard supervised learning improves results even further. On ZebraLogic, ScPO finetunes Llama-3 8B to be superior to Llama-3 70B, Gemma-2 27B, and Claude-3 Haiku.
摘要：自我对齐（Self-alignment），即模型在没有人类标注的情况下自我改进，是一个快速发展的研究领域。然而，现有技术在处理复杂推理任务时往往难以提升，主要原因是难以分配正确的奖励。一种已知能提高正确性的正交方法是自我一致性（Self-consistency），这是一种在推理阶段基于多次采样以寻找最一致答案的方法。在本研究中，我们将自我一致性的概念扩展到模型训练中。我们引入了自我一致性偏好优化（Self-consistency Preference Optimization, ScPO），该方法通过迭代训练，使一致性答案在无监督的新问题上优于不一致的答案。我们展示了 ScPO 在 GSM8K 和 MATH 等推理任务上显著优于传统的奖励模型训练，缩小了与使用黄金答案或偏好进行监督训练的差距，并且将 ScPO 与标准监督学习结合后，效果进一步提升。在 ZebraLogic 任务中，ScPO 微调的 Llama-3 8B 模型表现优于 Llama-3 70B、Gemma-2 27B 和 Claude-3 Haiku。

[NLP-2] How Transformers Solve Propositional Logic Problems: A Mechanistic Analysis

【速读】：该论文试图解决的问题是如何理解和揭示大型语言模型（LLMs）在执行复杂逻辑推理任务时的内部机制。解决方案的关键在于构建了一个合成命题逻辑问题，作为网络训练和评估的具体测试平台。通过训练一个三层的小型transformer模型，研究者能够识别出网络中负责“规划”和“推理”的电路，这些电路需要注意力模块之间的协作来实现所需的逻辑。此外，通过激活补丁技术对Mistral 7B模型进行研究，进一步揭示了内部组件在解决逻辑问题中的关键作用。总体而言，该研究系统性地揭示了小型和大型transformer模型的新特性，并继续探讨了它们如何进行规划和推理。

链接: https://arxiv.org/abs/2411.04105
作者: Guan Zhe Hong,Nishanth Dikkala,Enming Luo,Cyrus Rashtchian,Rina Panigrahy
关键词-EN: shown amazing performance, shown amazing, amazing performance, performance on tasks, tasks that require
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown amazing performance on tasks that require planning and reasoning. Motivated by this, we investigate the internal mechanisms that underpin a network’s ability to perform complex logical reasoning. We first construct a synthetic propositional logic problem that serves as a concrete test-bed for network training and evaluation. Crucially, this problem demands nontrivial planning to solve, but we can train a small transformer to achieve perfect accuracy. Building on our set-up, we then pursue an understanding of precisely how a three-layer transformer, trained from scratch, solves this problem. We are able to identify certain “planning” and “reasoning” circuits in the network that necessitate cooperation between the attention blocks to implement the desired logic. To expand our findings, we then study a larger model, Mistral 7B. Using activation patching, we characterize internal components that are critical in solving our logic problem. Overall, our work systemically uncovers novel aspects of small and large transformers, and continues the study of how they plan and reason.
摘要：大语言模型（LLMs）在需要规划和推理的任务中展现了惊人的表现。受此启发，我们研究了支撑网络进行复杂逻辑推理的内部机制。首先，我们构建了一个合成命题逻辑问题，作为网络训练和评估的具体测试平台。关键在于，这个问题需要非平凡的规划才能解决，但我们能够训练一个小型 Transformer 以达到完美的准确率。在此基础上，我们进一步探究了一个从头开始训练的三层 Transformer 如何解决这个问题。我们能够识别出网络中某些“规划”和“推理”电路，这些电路需要注意力模块之间的协作来实现所需的逻辑。为了扩展我们的发现，我们随后研究了一个更大的模型——Mistral 7B。通过激活补丁技术，我们描述了在解决我们的逻辑问题中起关键作用的内部组件。总体而言，我们的工作系统性地揭示了小型和大型 Transformer 的新颖方面，并继续研究它们如何进行规划和推理。

[NLP-3] Summarization of Opinionated Political Documents with Varied Perspectives

【速读】：该论文试图解决全球党派敌对和极化问题，特别是在总统选举期间。解决方案的关键在于开发能够生成准确多样视角摘要的模型，以帮助用户接触到不同的观点，从而减少极化。论文提出了一种新的数据集和任务，即独立地总结来自有立场新闻文章的多个段落中的每个政治视角。为此，论文设计了一个评估框架，用于评估视角摘要性能的不同维度，并通过自动和人工评估对10种不同大小和架构的模型进行了基准测试。尽管像GPT-4这样的最新模型在此任务中表现良好，但所有模型在生成忠实于目标视角的摘要方面仍面临挑战。

链接: https://arxiv.org/abs/2411.04093
作者: Nicholas Deas,Kathleen McKeown
关键词-EN: Global partisan hostility, Global partisan, presidential elections, partisan hostility, heightened around presidential
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Global partisan hostility and polarization has increased, and this polarization is heightened around presidential elections. Models capable of generating accurate summaries of diverse perspectives can help reduce such polarization by exposing users to alternative perspectives. In this work, we introduce a novel dataset and task for independently summarizing each political perspective in a set of passages from opinionated news articles. For this task, we propose a framework for evaluating different dimensions of perspective summary performance. We benchmark 10 models of varying sizes and architectures through both automatic and human evaluation. While recent models like GPT-4o perform well on this task, we find that all models struggle to generate summaries faithful to the intended perspective. Our analysis of summaries focuses on how extraction behavior depends on the features of the input documents.
摘要：全球党派敌意和极化现象日益加剧，尤其在总统选举期间，这种极化现象更为显著。能够生成准确多样视角摘要的模型，可以通过向用户展示不同观点来帮助减少这种极化。在本研究中，我们引入了一个新的数据集和任务，旨在独立地对来自有立场新闻文章的一系列段落中的每个政治视角进行摘要。为此任务，我们提出了一种评估框架，用于评估不同维度的视角摘要性能。我们通过自动和人工评估，对10种不同规模和架构的模型进行了基准测试。尽管像GPT-4o这样的近期模型在此任务上表现良好，但我们发现所有模型在生成忠实于目标视角的摘要方面都存在困难。我们对摘要的分析着重于提取行为如何依赖于输入文档的特征。

[NLP-4] A Collaborative Content Moderation Framework for Toxicity Detection based on Conformalized Estimates of Annotation Disagreement

【速读】：该论文试图解决内容审核中由于标注不一致性导致的模糊性问题。解决方案的关键在于引入了一种新的内容审核框架，该框架强调捕捉标注不一致性的重要性。具体来说，论文采用了多任务学习 (multitask learning) 方法，其中毒性分类作为主要任务，而标注不一致性则作为辅助任务。此外，论文利用不确定性估计技术，特别是保序预测 (Conformal Prediction)，来处理评论标注的模糊性和模型预测毒性时的内在不确定性。该框架还允许审核人员调整标注不一致性的阈值，从而灵活地决定何时因模糊性触发进一步审核。实验结果表明，这种联合方法不仅提升了模型性能和校准，还提高了参数效率和审核过程的效率，相较于单一任务方法有显著改进。

链接: https://arxiv.org/abs/2411.04090
作者: Guillermo Villate-Castillo,Javier Del Ser,Borja Sanz
关键词-EN: Content moderation typically, http URL, http URL framework, content moderation framework, moderation typically combines
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 35 pages, 1 figure

点击查看摘要

Abstract:Content moderation typically combines the efforts of human moderators and machine learning this http URL, these systems often rely on data where significant disagreement occurs during moderation, reflecting the subjective nature of toxicity this http URL than dismissing this disagreement as noise, we interpret it as a valuable signal that highlights the inherent ambiguity of the content,an insight missed when only the majority label is this http URL this work, we introduce a novel content moderation framework that emphasizes the importance of capturing annotation disagreement. Our approach uses multitask learning, where toxicity classification serves as the primary task and annotation disagreement is addressed as an auxiliary this http URL, we leverage uncertainty estimation techniques, specifically Conformal Prediction, to account for both the ambiguity in comment annotations and the model’s inherent uncertainty in predicting toxicity and this http URL framework also allows moderators to adjust thresholds for annotation disagreement, offering flexibility in determining when ambiguity should trigger a this http URL demonstrate that our joint approach enhances model performance, calibration, and uncertainty estimation, while offering greater parameter efficiency and improving the review process in comparison to single-task methods.
摘要：内容审核通常结合了人工审核员和机器学习的努力。然而，这些系统往往依赖于审核过程中出现显著分歧的数据，这反映了内容毒性的主观性。与其将这种分歧视为噪音，我们将其解释为一种宝贵的信号，突显了内容固有的模糊性，这一洞察在仅依赖多数标签时被忽视。在这项工作中，我们引入了一种新颖的内容审核框架，强调捕捉标注分歧的重要性。我们的方法采用多任务学习，其中毒性分类作为主要任务，而标注分歧则作为辅助任务。此外，我们利用不确定性估计技术，特别是Conformal Prediction，来考虑评论标注的模糊性以及模型在预测毒性时的固有不确定性。该框架还允许审核员调整标注分歧的阈值，从而在决定何时因模糊性触发审核时提供灵活性。我们证明，与单任务方法相比，我们的联合方法不仅提升了模型性能、校准和不确定性估计，还提高了参数效率并改进了审核流程。

[NLP-5] M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models

【速读】：该论文试图解决现有基准测试在评估基础模型时主要关注单一文档、纯文本任务的问题，这些测试未能充分捕捉科研工作流程的复杂性，包括解释非文本数据和跨多文档收集信息。解决方案的关键是引入M3SciQA，这是一个多模态、多文档的科学问答基准测试，旨在更全面地评估基础模型。M3SciQA包含1,452个专家注释的问题，涵盖70个自然语言处理论文集群，每个集群代表一篇主要论文及其所有引用的文档，模拟了通过多模态和多文档数据理解单篇论文的工作流程。通过M3SciQA，论文对18个基础模型进行了全面评估，结果表明当前基础模型在多模态信息检索和跨多科学文档推理方面仍显著落后于人类专家，并探讨了这些发现对未来基础模型在多模态科学文献分析中应用的启示。

链接: https://arxiv.org/abs/2411.04075
作者: Chuhan Li,Ziyao Shangguan,Yilun Zhao,Deyuan Li,Yixin Liu,Arman Cohan
关键词-EN: Existing benchmarks, text-only tasks, focus on single-document, foundation models, evaluating foundation models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing benchmarks for evaluating foundation models mainly focus on single-document, text-only tasks. However, they often fail to fully capture the complexity of research workflows, which typically involve interpreting non-textual data and gathering information across multiple documents. To address this gap, we introduce M3SciQA, a multi-modal, multi-document scientific question answering benchmark designed for a more comprehensive evaluation of foundation models. M3SciQA consists of 1,452 expert-annotated questions spanning 70 natural language processing paper clusters, where each cluster represents a primary paper along with all its cited documents, mirroring the workflow of comprehending a single paper by requiring multi-modal and multi-document data. With M3SciQA, we conduct a comprehensive evaluation of 18 foundation models. Our results indicate that current foundation models still significantly underperform compared to human experts in multi-modal information retrieval and in reasoning across multiple scientific documents. Additionally, we explore the implications of these findings for the future advancement of applying foundation models in multi-modal scientific literature analysis.
摘要：现有的基础模型评估基准主要集中在单一文档、纯文本任务上。然而，这些基准往往未能充分捕捉研究工作流程的复杂性，这些工作流程通常涉及解释非文本数据以及跨多个文档收集信息。为了填补这一空白，我们引入了 M3SciQA，这是一个多模态、多文档的科学问答基准，旨在对基础模型进行更全面的评估。M3SciQA 包含 1,452 个专家标注的问题，涵盖 70 个自然语言处理论文集群，每个集群代表一篇主要论文及其所有引用的文档，模拟了通过多模态和多文档数据理解单篇论文的工作流程。通过 M3SciQA，我们对 18 个基础模型进行了全面评估。结果表明，当前的基础模型在多模态信息检索和跨多个科学文档的推理方面，仍显著落后于人类专家。此外，我们还探讨了这些发现对未来在多模态科学文献分析中应用基础模型发展的影响。

[NLP-6] Beemo: Benchmark of Expert-edited Machine-generated Outputs

【速读】：该论文试图解决现有机器生成文本（MGT）基准测试中缺乏多作者场景的问题，特别是在用户对大型语言模型（LLM）生成的文本进行编辑以提高自然流畅性、连贯性和事实准确性时。解决方案的关键在于引入了专家编辑的机器生成文本基准（Beemo），该基准包括6.5k由人类编写、10个指令微调LLM生成的文本，以及由专家编辑的文本，涵盖从创意写作到摘要等多种用途。此外，Beemo还包括13.1k机器生成和LLM编辑的文本，支持对不同编辑类型的MGT检测进行多样化评估。通过详细记录Beemo的创建协议并测试33种MGT检测器的不同配置，研究发现专家编辑的文本能够规避MGT检测，而LLM编辑的文本则不太可能被识别为人类编写。

链接: https://arxiv.org/abs/2411.04032
作者: Ekaterina Artemova,Jason Lucas,Saranya Venkatraman,Jooyoung Lee,Sergei Tilga,Adaku Uchendu,Vladislav Mikhailov
关键词-EN: large language models, blurred text authorship, language models, rapid proliferation, proliferation of large
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid proliferation of large language models (LLMs) has increased the volume of machine-generated texts (MGTs) and blurred text authorship in various domains. However, most existing MGT benchmarks include single-author texts (human-written and machine-generated). This conventional design fails to capture more practical multi-author scenarios, where the user refines the LLM response for natural flow, coherence, and factual correctness. Our paper introduces the Benchmark of Expert-edited Machine-generated Outputs (Beemo), which includes 6.5k texts written by humans, generated by ten instruction-finetuned LLMs, and edited by experts for various use cases, ranging from creative writing to summarization. Beemo additionally comprises 13.1k machine-generated and LLM-edited texts, allowing for diverse MGT detection evaluation across various edit types. We document Beemo’s creation protocol and present the results of benchmarking 33 configurations of MGT detectors in different experimental setups. We find that expert-based editing evades MGT detection, while LLM-edited texts are unlikely to be recognized as human-written. Beemo and all materials are publicly available.
摘要：大语言模型 (LLM) 的快速普及增加了机器生成文本 (MGT) 的数量，并在多个领域模糊了文本作者的身份。然而，大多数现有的 MGT 基准测试仅包含单一作者的文本（人工撰写和机器生成）。这种传统设计未能捕捉到更为实际的多作者场景，即用户为实现自然流畅性、连贯性和事实准确性而对 LLM 的响应进行细化。本文介绍了专家编辑的机器生成输出基准 (Beemo)，其中包括由人类撰写的 6.5k 篇文本、由十个指令微调的 LLM 生成的文本，以及由专家为各种用例（从创意写作到摘要）编辑的文本。Beemo 还包含 13.1k 篇机器生成和 LLM 编辑的文本，允许在不同编辑类型下进行多样化的 MGT 检测评估。我们记录了 Beemo 的创建协议，并展示了在不同实验设置下对 33 种 MGT 检测器配置进行基准测试的结果。我们发现，基于专家的编辑能够规避 MGT 检测，而 LLM 编辑的文本不太可能被识别为人类撰写。Beemo 及其所有材料均公开可用。

[NLP-7] Prompt Engineering Using GPT for Word-Level Code-Mixed Language Identification in Low-Resource Dravidian Languages

【速读】：该论文试图解决在印度多语言社会中，特别是年轻人在社交媒体上使用的代码混合文本中，单词级别语言识别 (Language Identification, LI) 的挑战。解决方案的关键在于利用GPT-3.5 Turbo这一大型语言模型，通过提示 (prompt) 方法来分类和识别混合文本中的单词所属的语言类别。研究结果表明，Kannada模型的表现优于Tamil模型，特别是在准确性和可靠性方面，而Tamil模型在精确度和召回率上仍有提升空间。

链接: https://arxiv.org/abs/2411.04025
作者: Aniket Deroy,Subhankar Maity
关键词-EN: natural language processing, machine translation, Language Identification, sentiment analysis, information retrieval
类目: Computation and Language (cs.CL)
备注: Accepted at FIRE 2024 (Track: Word-level Language Identification in Dravidian Languages)

点击查看摘要

Abstract:Language Identification (LI) is crucial for various natural language processing tasks, serving as a foundational step in applications such as sentiment analysis, machine translation, and information retrieval. In multilingual societies like India, particularly among the youth engaging on social media, text often exhibits code-mixing, blending local languages with English at different linguistic levels. This phenomenon presents formidable challenges for LI systems, especially when languages intermingle within single words. Dravidian languages, prevalent in southern India, possess rich morphological structures yet suffer from under-representation in digital platforms, leading to the adoption of Roman or hybrid scripts for communication. This paper introduces a prompt based method for a shared task aimed at addressing word-level LI challenges in Dravidian languages. In this work, we leveraged GPT-3.5 Turbo to understand whether the large language models is able to correctly classify words into correct categories. Our findings show that the Kannada model consistently outperformed the Tamil model across most metrics, indicating a higher accuracy and reliability in identifying and categorizing Kannada language instances. In contrast, the Tamil model showed moderate performance, particularly needing improvement in precision and recall.
摘要：语言识别 (Language Identification, LI) 在多种自然语言处理任务中至关重要，是情感分析、机器翻译和信息检索等应用的基础步骤。在像印度这样的多语言社会中，尤其是在参与社交媒体的年轻人中，文本常常表现出代码混合现象，即在不同语言层次上混合使用本地语言和英语。这种现象给语言识别系统带来了巨大的挑战，尤其是在语言在单个词中混合使用时。达罗毗荼语系在印度南部广泛使用，具有丰富的形态结构，但在数字平台上代表性不足，导致在交流中采用罗马字符或混合字符。本文介绍了一种基于提示的方法，旨在解决达罗毗荼语系中的词级语言识别挑战。在这项工作中，我们利用了 GPT-3.5 Turbo 来研究大语言模型是否能够正确地将词分类到正确的类别中。我们的研究结果表明，在大多数指标上，卡纳达语模型始终优于泰米尔语模型，显示出更高的准确性和可靠性，能够识别和分类卡纳达语实例。相比之下，泰米尔语模型的表现中等，特别是在精确度和召回率方面需要改进。

[NLP-8] WorryWords: Norms of Anxiety Association for over 44k English Words

【速读】：该论文试图解决焦虑与语言之间关系的未知问题，特别是焦虑如何通过语言表现以及焦虑词汇在不同年龄段儿童中的习得率。解决方案的关键在于引入了WorryWords，这是一个包含超过44,450个英语单词的手动衍生焦虑关联词库，首次实现了大规模的焦虑词汇关联研究。通过WorryWords，研究者能够可靠地分析焦虑与其他情感构造的关系，以及焦虑词汇在儿童中的习得过程。此外，WorryWords还展示了其在文本流中准确追踪焦虑变化的能力，为心理学、自然语言处理(NLP)、公共卫生和社会科学领域的广泛研究提供了基础。

链接: https://arxiv.org/abs/2411.03966
作者: Saif M. Mohammad
关键词-EN: potential negative outcome, beneficial human emotion, negative outcome, anticipatory unease, potential negative
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Anxiety, the anticipatory unease about a potential negative outcome, is a common and beneficial human emotion. However, there is still much that is not known, such as how anxiety relates to our body and how it manifests in language. This is especially pertinent given the increasing impact of anxiety-related disorders. In this work, we introduce WorryWords, the first large-scale repository of manually derived word–anxiety associations for over 44,450 English words. We show that the anxiety associations are highly reliable. We use WorryWords to study the relationship between anxiety and other emotion constructs, as well as the rate at which children acquire anxiety words with age. Finally, we show that using WorryWords alone, one can accurately track the change of anxiety in streams of text. The lexicon enables a wide variety of anxiety-related research in psychology, NLP, public health, and social sciences. WorryWords (and its translations to over 100 languages) is freely available. this http URL
摘要：焦虑，即对潜在负面结果的预期不安，是一种常见且有益的人类情感。然而，关于焦虑如何与我们的身体相关联以及它在语言中如何表现，仍有许多未知之处。鉴于焦虑相关障碍的影响日益增加，这一问题尤为重要。在本研究中，我们引入了 WorryWords，这是首个大规模的手动提取的单词与焦虑关联的资源库，涵盖了超过 44,450 个英语单词。我们证明，这些焦虑关联具有高度的可靠性。我们利用 WorryWords 研究了焦虑与其他情感构造之间的关系，以及儿童随着年龄增长获取焦虑词汇的速度。最后，我们展示了仅使用 WorryWords，就可以准确追踪文本流中焦虑的变化。该词典为心理学、自然语言处理 (NLP)、公共卫生和社会科学等领域的焦虑相关研究提供了广泛的可能性。WorryWords（及其超过 100 种语言的翻译版本）均可免费获取。

[NLP-9] What Really is Commonsense Knowledge?

【速读】：该论文试图解决常识推理基准数据集中存在大量非常识知识实例的问题，这影响了模型真实常识推理能力的评估。解决方案的关键在于通过调查现有常识知识的定义，将其整合为一个多框架统一的定义（即整合定义），并利用这一整合定义对CommonsenseQA和CommonsenseQA 2.0数据集进行重新标注和实验。研究结果表明，这两个数据集中确实存在大量非常识知识实例，且大型语言模型在处理这些实例时表现较差。

链接: https://arxiv.org/abs/2411.03964
作者: Quyet V. Do,Junze Li,Tung-Duong Vuong,Zhaowei Wang,Yangqiu Song,Xiaojuan Ma
关键词-EN: Natural Language Processing, developed in Natural, Language Processing, Natural Language, commonsense knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code and data will be released together with the next version of the paper

点击查看摘要

Abstract:Commonsense datasets have been well developed in Natural Language Processing, mainly through crowdsource human annotation. However, there are debates on the genuineness of commonsense reasoning benchmarks. In specific, a significant portion of instances in some commonsense benchmarks do not concern commonsense knowledge. That problem would undermine the measurement of the true commonsense reasoning ability of evaluated models. It is also suggested that the problem originated from a blurry concept of commonsense knowledge, as distinguished from other types of knowledge. To demystify all of the above claims, in this study, we survey existing definitions of commonsense knowledge, ground into the three frameworks for defining concepts, and consolidate them into a multi-framework unified definition of commonsense knowledge (so-called consolidated definition). We then use the consolidated definition for annotations and experiments on the CommonsenseQA and CommonsenseQA 2.0 datasets to examine the above claims. Our study shows that there exists a large portion of non-commonsense-knowledge instances in the two datasets, and a large performance gap on these two subsets where Large Language Models (LLMs) perform worse on commonsense-knowledge instances.
摘要：常识数据集在自然语言处理领域得到了充分的发展，主要通过众包人工标注实现。然而，关于常识推理基准的真实性存在争议。具体而言，某些常识基准中相当一部分实例并不涉及常识知识。这一问题将削弱对评估模型真正常识推理能力的衡量。有观点认为，这一问题的根源在于常识知识概念的模糊性，与其他类型的知识难以区分。为了澄清上述所有观点，本研究调查了现有的常识知识定义，将其纳入定义概念的三大框架，并整合成一个多框架统一的常识知识定义（即整合定义）。随后，我们使用这一整合定义对CommonsenseQA和CommonsenseQA 2.0数据集进行标注和实验，以检验上述观点。研究表明，这两个数据集中存在大量非常识知识实例，并且在这些子集中，大语言模型在常识知识实例上的表现明显较差。

[NLP-10] How Does A Text Preprocessing Pipeline Affect Ontology Syntactic Matching?

【速读】：该论文试图解决在语义网中本体匹配（Ontology Matching, OM）任务中，文本预处理管道（Text Preprocessing Pipeline）缺乏标准化导致映射结果多样性的问题。解决方案的关键在于：首先，通过实验验证了分词（Tokenisation）和规范化（Normalisation）在当前OM任务中比停用词移除（Stop Words Removal）和词干提取/词形还原（Stemming/Lemmatisation）更为有效；其次，提出了任务特定的词形还原和词干提取选择，推荐单独使用词形还原或词干提取，并进行事后修正；第三，发现Porter词干提取器和Snowball词干提取器优于Lancaster词干提取器；最后，提出了一种基于上下文的管道修复方法，显著提高了匹配的正确性和整体匹配性能。此外，论文还讨论了在大语言模型（Large Language Models, LLMs）时代中文本预处理管道的应用。

链接: https://arxiv.org/abs/2411.03962
作者: Zhangcheng Qiang,Kerry Taylor,Weiqing Wang
关键词-EN: Stop Words Removal, Words Removal, Stop Words, text preprocessing pipeline, generic text preprocessing
类目: Computation and Language (cs.CL)
备注: 13 pages, 26 figures, 4 tables

点击查看摘要

Abstract:The generic text preprocessing pipeline, comprising Tokenisation, Normalisation, Stop Words Removal, and Stemming/Lemmatisation, has been implemented in many ontology matching (OM) systems. However, the lack of standardisation in text preprocessing creates diversity in mapping results. In this paper, we investigate the effect of the text preprocessing pipeline on OM tasks at syntactic levels. Our experiments on 8 Ontology Alignment Evaluation Initiative (OAEI) track repositories with 49 distinct alignments indicate: (1) Tokenisation and Normalisation are currently more effective than Stop Words Removal and Stemming/Lemmatisation; and (2) The selection of Lemmatisation and Stemming is task-specific. We recommend standalone Lemmatisation or Stemming with post-hoc corrections. We find that (3) Porter Stemmer and Snowball Stemmer perform better than Lancaster Stemmer; and that (4) Part-of-Speech (POS) Tagging does not help Lemmatisation. To repair less effective Stop Words Removal and Stemming/Lemmatisation used in OM tasks, we propose a novel context-based pipeline repair approach that significantly improves matching correctness and overall matching performance. We also discuss the use of text preprocessing pipeline in the new era of large language models (LLMs).
摘要：通用的文本预处理流程，包括分词 (Tokenisation)、规范化 (Normalisation)、停用词移除 (Stop Words Removal) 和词干提取/词形还原 (Stemming/Lemmatisation)，已在许多本体匹配 (OM) 系统中得到实现。然而，文本预处理缺乏标准化，导致映射结果多样化。本文研究了文本预处理流程对语法层面本体匹配任务的影响。我们在8个本体对齐评估倡议 (OAEI) 跟踪库中进行的49个不同对齐实验表明：(1) 分词和规范化目前比停用词移除和词干提取/词形还原更有效；(2) 词形还原和词干提取的选择具有任务特异性，建议单独使用词形还原或词干提取，并进行事后修正。我们发现 (3) Porter 词干提取器和 Snowball 词干提取器的表现优于 Lancaster 词干提取器；(4) 词性标注 (POS Tagging) 对词形还原没有帮助。为了修复在 OM 任务中效果不佳的停用词移除和词干提取/词形还原，我们提出了一种基于上下文的新型流程修复方法，显著提高了匹配的正确性和整体匹配性能。我们还讨论了在新兴的大语言模型 (LLMs) 时代中文本预处理流程的应用。

[NLP-11] Interactions Across Blocks in Post-Training Quantization of Large Language Models

【速读】：该论文试图解决在大型语言模型中进行权重量化时，由于局部优化策略（如单独量化每个层或块）导致的量化误差问题。解决方案的关键在于提出了两种多块微调策略：第一种策略通过联合优化多个量化块来捕捉跨块的权重相关性；第二种策略通过最小化下游预激活误差，将后续块的知识纳入优化过程，而不是仅关注当前量化块。这两种策略旨在减少量化误差，并在不同模型上展示了不同的效果，表明其有效性依赖于具体的网络模型。

链接: https://arxiv.org/abs/2411.03934
作者: Khasmamad Shabanovi,Lukas Wiest,Vladimir Golkov,Daniel Cremers,Thomas Pfeil
关键词-EN: Post-training quantization, widely employed, employed to reduce, reduce the computational, computational demands
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Post-training quantization is widely employed to reduce the computational demands of neural networks. Typically, individual substructures, such as layers or blocks of layers, are quantized with the objective of minimizing quantization errors in their pre-activations by fine-tuning the corresponding weights. Deriving this local objective from the global objective of minimizing task loss involves two key simplifications: assuming substructures are mutually independent and ignoring the knowledge of subsequent substructures as well as the task loss. In this work, we assess the effects of these simplifications on weight-only quantization of large language models. We introduce two multi-block fine-tuning strategies and compare them against the baseline of fine-tuning single transformer blocks. The first captures correlations of weights across blocks by jointly optimizing multiple quantized blocks. The second incorporates knowledge of subsequent blocks by minimizing the error in downstream pre-activations rather than focusing solely on the quantized block. Our findings indicate that the effectiveness of these methods depends on the specific network model, with no impact on some models but demonstrating significant benefits for others.
摘要：训练后量化广泛应用于降低神经网络的计算需求。通常，通过对相应权重进行微调，以最小化其预激活中的量化误差为目标，对层或层块等个体子结构进行量化。从最小化任务损失的全局目标推导出这种局部目标涉及两个关键简化：假设子结构相互独立，并忽略后续子结构的知识以及任务损失。在本研究中，我们评估了这些简化对大语言模型权重仅量化的影响。我们引入了两种多块微调策略，并将其与单个Transformer块微调的基线方法进行比较。第一种策略通过联合优化多个量化块来捕捉跨块权重的相关性。第二种策略通过最小化下游预激活误差，而非仅关注量化块，来纳入后续块的知识。我们的研究结果表明，这些方法的有效性取决于具体的网络模型，对某些模型没有影响，但对其他模型则显示出显著的益处。

[NLP-12] Evaluation data contamination in LLM s: how do we measure it and (when) does it matter?

【速读】：该论文试图解决评估数据污染（evaluation data contamination）对大型语言模型（LLMs）基准测试分数的影响问题。解决方案的关键在于提出了一种新的分析方法，称为ConTAM（Contamination Analysis Method），通过评估模型是否从标记为污染的样本中获益来衡量污染指标的有效性。论文通过大规模调查现有和新颖的基于n-gram的污染指标，展示了ConTAM在理解评估数据污染及其影响方面的有效性。研究发现，污染可能比近期LLM发布中报告的影响更大，并且在不同规模下对模型的影响不同。此外，论文还强调了模型和基准特定阈值分析的重要性，以及在选择超参数时避免产生过多假阴性的策略。通过ConTAM，论文提供了一种将评估数据污染指标与下游效果联系起来的实证方法，并为未来的研究提供了具体的建议。

链接: https://arxiv.org/abs/2411.03923
作者: Aaditya K. Singh,Muhammed Yusuf Kocyigit,Andrew Poulton,David Esiobu,Maria Lomeli,Gergely Szilvasy,Dieuwke Hupkes
关键词-EN: evaluation data contamination, Hampering the interpretation, data contamination, evaluation data, contamination
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hampering the interpretation of benchmark scores, evaluation data contamination has become a growing concern in the evaluation of LLMs, and an active area of research studies its effects. While evaluation data contamination is easily understood intuitively, it is surprisingly difficult to define precisely which samples should be considered contaminated and, consequently, how it impacts benchmark scores. We propose that these questions should be addressed together and that contamination metrics can be assessed based on whether models benefit from the examples they mark contaminated. We propose a novel analysis method called ConTAM, and show with a large scale survey of existing and novel n-gram based contamination metrics across 13 benchmarks and 7 models from 2 different families that ConTAM can be used to better understand evaluation data contamination and its effects. We find that contamination may have a much larger effect than reported in recent LLM releases and benefits models differently at different scales. We also find that considering only the longest contaminated substring provides a better signal than considering a union of all contaminated substrings, and that doing model and benchmark specific threshold analysis greatly increases the specificity of the results. Lastly, we investigate the impact of hyperparameter choices, finding that, among other things, both using larger values of n and disregarding matches that are infrequent in the pre-training data lead to many false negatives. With ConTAM, we provide a method to empirically ground evaluation data contamination metrics in downstream effects. With our exploration, we shed light on how evaluation data contamination can impact LLMs and provide insight into the considerations important when doing contamination analysis. We end our paper by discussing these in more detail and providing concrete suggestions for future work.
摘要：在评估大语言模型 (LLM) 的过程中，评估数据污染已成为一个日益严重的问题，并成为一个活跃的研究领域，探讨其影响。尽管评估数据污染在直觉上易于理解，但精确界定哪些样本应被视为污染，以及其如何影响基准评分，却出乎意料地困难。我们提出，这些问题应一并解决，并基于模型是否从其标记为污染的样本中获益，来评估污染指标。我们提出了一种名为 ConTAM 的新分析方法，并通过大规模调查现有和新颖的基于 n-gram 的污染指标，涵盖 13 个基准和来自两个不同家族的 7 个模型，展示了 ConTAM 能够更好地理解评估数据污染及其影响。我们发现，污染的影响可能比近期 LLM 发布中报告的更大，并且在不同规模下对模型的影响不同。我们还发现，仅考虑最长的污染子串比考虑所有污染子串的并集能提供更好的信号，并且进行模型和基准特定的阈值分析能极大地提高结果的特异性。最后，我们研究了超参数选择的影响，发现，除了其他因素外，使用较大的 n 值和忽略在预训练数据中不常见的匹配项，会导致许多假阴性。通过 ConTAM，我们提供了一种方法，将评估数据污染指标基于下游效果进行实证验证。通过我们的探索，我们揭示了评估数据污染如何影响 LLM，并为进行污染分析时的重要考虑因素提供了见解。我们在论文结尾详细讨论了这些内容，并为未来的工作提供了具体的建议。

[NLP-13] RAGulator: Lightweight Out-of-Context Detectors for Grounded Text Generation

【速读】：该论文试图解决企业采用检索增强生成 (RAG) 应用时，实时检测生成式语言模型 (LLM) 输出内容是否语义上与检索文档不一致的问题。解决方案的关键在于训练轻量级模型，以区分语义上不一致的 LLM 生成文本与检索文档。通过预处理摘要和语义文本相似性数据集，构建训练数据，发现 DeBERTa 模型在此流程中表现最佳，且无需额外的文本预处理或特征工程，同时具备速度快和资源消耗低的优点，这对于本地部署尤为重要。

链接: https://arxiv.org/abs/2411.03920
作者: Ian Poey,Jiajun Liu,Qishuai Zhong,Adrien Chenailler
关键词-EN: adopt RAG applications, safely adopt RAG, RAG applications, Real-time detection, adopt RAG
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Real-time detection of out-of-context LLM outputs is crucial for enterprises looking to safely adopt RAG applications. In this work, we train lightweight models to discriminate LLM-generated text that is semantically out-of-context from retrieved text documents. We preprocess a combination of summarisation and semantic textual similarity datasets to construct training data using minimal resources. We find that DeBERTa is not only the best-performing model under this pipeline, but it is also fast and does not require additional text preprocessing or feature engineering. While emerging work demonstrates that generative LLMs can also be fine-tuned and used in complex data pipelines to achieve state-of-the-art performance, we note that speed and resource limits are important considerations for on-premise deployment.
摘要：实时检测大语言模型（LLM）输出的上下文不一致性对于希望安全采用检索增强生成（RAG）应用的企业至关重要。在本研究中，我们训练了轻量级模型，用于区分语义上与检索文本文档不一致的LLM生成文本。我们通过结合摘要和语义文本相似性数据集，利用最少的资源构建了训练数据。研究发现，DeBERTa不仅在此流程中表现最佳，而且速度快，无需额外的文本预处理或特征工程。尽管现有研究表明，生成式大语言模型也可以通过微调并应用于复杂的数据处理流程中，以达到最先进的性能，但我们注意到，速度和资源限制是本地部署时需要考虑的重要因素。

[NLP-14] Lexicalization Is All You Need: Examining the Impact of Lexical Knowledge in a Compositional QALD System

【速读】：该论文试图解决在基于链接数据的问答系统（Question Answering over Linked Data, QALD）中，自然语言查询与SPARQL查询之间的词汇鸿沟问题。解决方案的关键在于引入词汇化（lexicalization），即通过显式知识来明确单词在给定词汇表中的潜在解释。论文提出了一种组合式问答系统，该系统能够利用显式的词汇知识以组合方式推断问题的含义，并将其转换为SPARQL查询。实验结果表明，这种基于词汇知识的系统在QALD-9上的微观F1分数比现有最佳系统提高了35.8%，突显了词汇化和组合性在问答系统中的重要性和潜力。相比之下，大型语言模型（LLMs）在利用词汇知识方面能力有限，仅能实现边际改进，表明其在组合解释问题方面存在局限性。

链接: https://arxiv.org/abs/2411.03906
作者: David Maria Schmidt,Mohammad Fazleh Elahi,Philipp Cimiano
关键词-EN: Linked Data, Answering over Linked, Question Answering, lexical knowledge, explicit lexical knowledge
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 24th International Conference on Knowledge Engineering and Knowledge Management (EKAW 2024), November 26-28, 2024, Amsterdam, The Netherlands

点击查看摘要

Abstract:In this paper, we examine the impact of lexicalization on Question Answering over Linked Data (QALD). It is well known that one of the key challenges in interpreting natural language questions with respect to SPARQL lies in bridging the lexical gap, that is mapping the words in the query to the correct vocabulary elements. We argue in this paper that lexicalization, that is explicit knowledge about the potential interpretations of a word with respect to the given vocabulary, significantly eases the task and increases the performance of QA systems. Towards this goal, we present a compositional QA system that can leverage explicit lexical knowledge in a compositional manner to infer the meaning of a question in terms of a SPARQL query. We show that such a system, given lexical knowledge, has a performance well beyond current QA systems, achieving up to a 35.8% increase in the micro F_1 score compared to the best QA system on QALD-9. This shows the importance and potential of including explicit lexical knowledge. In contrast, we show that LLMs have limited abilities to exploit lexical knowledge, with only marginal improvements compared to a version without lexical knowledge. This shows that LLMs have no ability to compositionally interpret a question on the basis of the meaning of its parts, a key feature of compositional approaches. Taken together, our work shows new avenues for QALD research, emphasizing the importance of lexicalization and compositionality.
摘要：本文探讨了词汇化对基于链接数据的问答系统（Question Answering over Linked Data, QALD）的影响。众所周知，在将自然语言问题转化为SPARQL查询时，一个关键挑战在于弥合词汇鸿沟，即如何将查询中的词语映射到正确的词汇元素上。本文认为，词汇化——即关于词语在给定词汇中潜在解释的显式知识——显著简化了这一任务，并提升了问答系统的性能。为此，我们提出了一种组合式问答系统，该系统能够以组合方式利用显式词汇知识，从而推断出问题在SPARQL查询中的含义。实验结果表明，在具备词汇知识的情况下，该系统的性能远超当前的问答系统，相较于QALD-9上的最佳问答系统，其微观F1分数提升了高达35.8%。这凸显了显式词汇知识的重要性和潜力。相比之下，大语言模型（LLM）在利用词汇知识方面能力有限，仅能实现微小的性能提升，且相较于无词汇知识的版本，改进甚微。这表明，大语言模型无法基于词语部分的含义进行组合式解释，而这是组合式方法的关键特征。综上所述，我们的研究为QALD研究开辟了新的方向，强调了词汇化和组合性的重要性。

[NLP-15] Computational Analysis of Gender Depiction in the Comedias of Calderon de la Barca

【速读】：该论文试图解决的问题是如何量化分析17世纪西班牙剧作家佩德罗·卡尔德隆·德拉巴尔卡（Pedro Calderón de la Barca）的非宗教作品（comedias）中的性别描绘。解决方案的关键在于开发了一种量化方法，通过使用性别分类器和模型可解释性（attribution）方法，确定文本特征对模型分类为“男性”或“女性”的影响力，从而以人类可理解的方式揭示卡尔德隆作品中对话的最具性别特征的元素。该方法不仅能够区分男性和女性角色的描绘，还能在实际应用中达到较高的准确率（最高可达f=0.83），并揭示性别描绘的语义层面，甚至在场景层面准确预测跨性别角色的表现。

链接: https://arxiv.org/abs/2411.03895
作者: Allison Keith,Antonio Rojas Castro,Sebastian Padó
关键词-EN: explore culturally based, based gender norms, culturally based gender, explore culturally, culturally based
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In theatre, playwrights use the portrayal of characters to explore culturally based gender norms. In this paper, we develop quantitative methods to study gender depiction in the non-religious works (comedias) of Pedro Calderón de la Barca, a prolific Spanish 17th century author. We gather insights from a corpus of more than 100 plays by using a gender classifier and applying model explainability (attribution) methods to determine which text features are most influential in the model’s decision to classify speech as ‘male’ or ‘female’, indicating the most gendered elements of dialogue in Calderón’s comedias in a human accessible manner. We find that female and male characters are portrayed differently and can be identified by the gender prediction model at practically useful accuracies (up to f=0.83). Analysis reveals semantic aspects of gender portrayal, and demonstrates that the model is even useful in providing a relatively accurate scene-by-scene prediction of cross-dressing characters.
摘要：在戏剧领域，剧作家通过角色的刻画来探讨基于文化的性别规范。本文中，我们开发了定量方法来研究佩德罗·卡尔德隆·德拉巴尔卡（Pedro Calderón de la Barca）这位17世纪西班牙多产作家的非宗教作品（喜剧）中的性别描绘。我们通过使用性别分类器并应用模型可解释性（归因）方法，分析了超过100部剧本的语料库，以确定哪些文本特征对模型将对话分类为“男性”或“女性”的决定最具影响力，从而以人类可理解的方式揭示卡尔德隆喜剧中最具性别特征的对话元素。研究发现，女性和男性角色的刻画方式有所不同，并且性别预测模型在实际应用中可以达到相当高的准确率（最高可达f=0.83）。分析揭示了性别描绘的语义层面，并证明该模型在提供相对准确的场景间变装角色预测方面也具有实用性。

[NLP-16] Multi3Hate: Multimodal Multilingual and Multicultural Hate Speech Detection with Vision-Language Models

【速读】：该论文试图解决多模态和多语言环境下仇恨言论的识别与分类问题，特别是在不同文化背景下，仇恨言论的定义和感知存在显著差异。解决方案的关键在于创建了一个名为Multi3Hate的多模态和多语言平行仇恨言论数据集，该数据集包含300个跨5种语言（英语、德语、西班牙语、印地语和普通话）的平行表情包样本，并由多元文化的标注者进行标注。通过分析不同文化背景下的标注一致性，发现文化因素显著影响多模态仇恨言论的标注结果，平均国家间的一致性仅为74%，远低于随机选择标注者组的一致性。此外，实验表明现有的视觉语言模型（VLMs）在零样本设置下更倾向于与美国文化的标注结果一致，即使在其他文化的主导语言环境中也是如此。

链接: https://arxiv.org/abs/2411.03888
作者: Minh Duc Bui,Katharina von der Wense,Anne Lauscher
关键词-EN: offensive or upsetting, Warning, Hate speech, multimodal hate speech, Hate speech moderation
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Warning: this paper contains content that may be offensive or upsetting Hate speech moderation on global platforms poses unique challenges due to the multimodal and multilingual nature of content, along with the varying cultural perceptions. How well do current vision-language models (VLMs) navigate these nuances? To investigate this, we create the first multimodal and multilingual parallel hate speech dataset, annotated by a multicultural set of annotators, called Multi3Hate. It contains 300 parallel meme samples across 5 languages: English, German, Spanish, Hindi, and Mandarin. We demonstrate that cultural background significantly affects multimodal hate speech annotation in our dataset. The average pairwise agreement among countries is just 74%, significantly lower than that of randomly selected annotator groups. Our qualitative analysis indicates that the lowest pairwise label agreement-only 67% between the USA and India-can be attributed to cultural factors. We then conduct experiments with 5 large VLMs in a zero-shot setting, finding that these models align more closely with annotations from the US than with those from other cultures, even when the memes and prompts are presented in the dominant language of the other culture. Code and dataset are available at this https URL. Comments: Preprint Subjects: Computation and Language (cs.CL) Cite as: arXiv:2411.03888 [cs.CL] (or arXiv:2411.03888v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.03888 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：警告：本文包含可能令人反感或不安的内容。全球平台上的仇恨言论监管因内容的多模态和多语言特性，以及文化认知的差异而面临独特挑战。当前的视觉-语言模型（Vision-Language Models, VLMs）在这些细微差别上表现如何？为探讨这一问题，我们创建了首个多模态和多语言并行的仇恨言论数据集，由多元文化背景的标注者进行标注，命名为Multi3Hate。该数据集包含300个跨5种语言（英语、德语、西班牙语、印地语和普通话）的并行模因样本。我们发现，文化背景显著影响多模态仇恨言论的标注结果。各国之间的平均成对一致率仅为74%，显著低于随机选择的标注者组。定性分析显示，美国与印度之间的最低成对标签一致率（仅67%）可归因于文化因素。随后，我们在零样本设置下对5个大型VLMs进行了实验，发现这些模型更倾向于与美国标注者的结果一致，而非其他文化背景的标注者，即便模因和提示以其他文化的主导语言呈现。代码和数据集可通过以下链接获取。

评论：预印本主题：计算与语言（cs.CL）引用方式：arXiv:2411.03888 [cs.CL]（或arXiv:2411.03888v1 [cs.CL] 用于此版本） https://doi.org/10.48550/arXiv.2411.03888 了解更多信息 arXiv通过DataCite发布的DOI（待注册）

[NLP-17] Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models

【速读】：该论文试图解决现有Transformer架构中激活函数（如ReLU）在非线性表达能力上的局限性问题。解决方案的关键在于提出了一种新型多项式组合激活函数（Polynomial Composition Activations, PolyCom），通过优化Transformer的动态特性，增强其表达能力。理论分析表明，PolyCom网络在Sobolev空间中对一般光滑函数的逼近达到了最优逼近率，意味着PolyCom网络在参数数量上具有更高的效率。实验结果显示，通过将传统激活函数替换为PolyCom，大型语言模型（LLMs）在捕捉数据中的高阶交互方面表现更佳，从而在准确性和收敛速度上显著提升。

链接: https://arxiv.org/abs/2411.03884
作者: Zhijian Zhuo,Ya Wang,Yutao Zeng,Xiaoqing Li,Xun Zhou,Jinwen Ma
关键词-EN: powerful fitting capabilities, found extensive applications, fitting capabilities, domains due, powerful fitting
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Transformers have found extensive applications across various domains due to the powerful fitting capabilities. This success can be partially attributed to their inherent nonlinearity. Thus, in addition to the ReLU function employed in the original transformer architecture, researchers have explored alternative modules such as GeLU and SwishGLU to enhance nonlinearity and thereby augment representational capacity. In this paper, we propose a novel category of polynomial composition activations (PolyCom), designed to optimize the dynamics of transformers. Theoretically, we provide a comprehensive mathematical analysis of PolyCom, highlighting its enhanced expressivity and efficacy relative to other activation functions. Notably, we demonstrate that networks incorporating PolyCom achieve the \textbfoptimal approximation rate , indicating that PolyCom networks require minimal parameters to approximate general smooth functions in Sobolev spaces. We conduct empirical experiments on the pre-training configurations of large language models (LLMs), including both dense and sparse architectures. By substituting conventional activation functions with PolyCom, we enable LLMs to capture higher-order interactions within the data, thus improving performance metrics in terms of accuracy and convergence rates. Extensive experimental results demonstrate the effectiveness of our method, showing substantial improvements over other activation functions. Code is available at this https URL.
摘要：Transformer 由于其强大的拟合能力，在各个领域得到了广泛的应用。这一成功部分归功于其固有的非线性特性。因此，除了原始 Transformer 架构中使用的 ReLU 函数外，研究人员还探索了如 GeLU 和 SwishGLU 等替代模块，以增强非线性并提升表示能力。本文提出了一类新的多项式组合激活函数（PolyCom），旨在优化 Transformer 的动态特性。理论上，我们对 PolyCom 进行了全面的数学分析，突显了其在表达能力和效率方面相对于其他激活函数的增强。特别地，我们证明了包含 PolyCom 的网络实现了最优逼近率，表明 PolyCom 网络在逼近 Sobolev 空间中的通用光滑函数时所需的参数最少。我们在大语言模型（LLMs）的预训练配置上进行了实证实验，包括密集和稀疏架构。通过将传统激活函数替换为 PolyCom，我们使 LLMs 能够捕捉数据中的高阶交互作用，从而在准确性和收敛速度方面提升性能指标。广泛的实验结果证明了我们方法的有效性，显示出相对于其他激活函数的显著改进。代码可在以下链接获取：https URL。

[NLP-18] MEG: Medical Knowledge-Augmented Large Language Models for Question Answering

【速读】：该论文试图解决医学领域中大型语言模型（LLMs）在处理专业知识时表现不佳的问题，特别是这些模型难以理解概念间的关系。解决方案的关键在于提出了一种参数高效的医学知识增强方法，称为MEG。MEG通过使用轻量级的映射网络将图嵌入（graph embeddings）集成到LLM中，从而使模型能够以成本效益高的方式利用外部知识。这种方法显著提升了模型在医学多选题数据集上的准确性，平均比Mistral-Instruct基线提高了10.2%，比专门化的模型如BioMistral提高了6.7%。

链接: https://arxiv.org/abs/2411.03883
作者: Laura Cabello,Carmen Martin-Turrero,Uchenna Akujuobi,Anders Søgaard,Carlos Bobed
关键词-EN: natural language understanding, language understanding task, Question answering, question answering systems, context and unstated
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Question answering is a natural language understanding task that involves reasoning over both explicit context and unstated, relevant domain knowledge. Large language models (LLMs), which underpin most contemporary question answering systems, struggle to induce how concepts relate in specialized domains such as medicine. Existing medical LLMs are also costly to train. In this work, we present MEG, a parameter-efficient approach for medical knowledge-augmented LLMs. MEG uses a lightweight mapping network to integrate graph embeddings into the LLM, enabling it to leverage external knowledge in a cost-effective way. We evaluate our method on four popular medical multiple-choice datasets and show that LLMs greatly benefit from the factual grounding provided by knowledge graph embeddings. MEG attains an average of +10.2% accuracy over the Mistral-Instruct baseline, and +6.7% over specialized models like BioMistral. We also show results based on Llama-3. Finally, we show that MEG’s performance remains robust to the choice of graph encoder.
摘要：问答是一项涉及对显式上下文和未明确提及的相关领域知识进行推理的自然语言理解任务。支撑大多数当代问答系统的大语言模型 (LLM) 在推导医学等专业领域内概念间关系时表现不佳。现有的医学大语言模型训练成本也较高。在本研究中，我们提出了 MEG，一种用于增强医学知识的大语言模型的参数高效方法。MEG 利用一个轻量级的映射网络将图嵌入整合到 LLM 中，使其能够以经济高效的方式利用外部知识。我们在四个流行的医学多项选择数据集上评估了我们的方法，结果表明，LLM 从知识图谱嵌入提供的事实基础中获益匪浅。MEG 在 Mistral-Instruct 基线上的平均准确率提高了 10.2%，在专业模型如 BioMistral 上的准确率提高了 6.7%。我们还展示了基于 Llama-3 的结果。最后，我们证明了 MEG 的性能对图编码器的选择具有稳健性。

[NLP-19] Performance evaluation of SLAM-ASR: The Good the Bad the Ugly and the Way Forward ICASSP2025

【速读】：该论文试图解决的问题是评估和提升基于大型语言模型（LLM）的自动语音识别（ASR）系统在不同场景和语音条件下的鲁棒性，特别是在跨领域评估和语音扰动情况下的表现。解决方案的关键在于通过一系列消融实验（ablation experiments）来分析和优化SLAM-ASR架构的性能。研究发现，SLAM-ASR在跨领域评估中表现不佳，且语音扰动（如速度变化和加性噪声）会显著影响其性能。这些发现为如何根据数据特性和计算资源来微调和配置鲁棒的LLM-based ASR模型提供了重要见解。

链接: https://arxiv.org/abs/2411.03866
作者: Shashi Kumar,Iuliia Thorbecke,Sergio Burdisso,Esaú Villatoro-Tello,Manjunath K E,Kadri Hacioğlu,Pradeep Rangappa,Petr Motlicek,Aravind Ganapathiraju,Andreas Stolcke
关键词-EN: strong ASR capabilities, achieve strong ASR, speech foundation encoders, large language models, research has demonstrated
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to ICASSP 2025 SALMA Workshop

点击查看摘要

Abstract:Recent research has demonstrated that training a linear connector between speech foundation encoders and large language models (LLMs) enables this architecture to achieve strong ASR capabilities. Despite the impressive results, it remains unclear whether these simple approaches are robust enough across different scenarios and speech conditions, such as domain shifts and different speech perturbations. In this paper, we address these questions by conducting various ablation experiments using a recent and widely adopted approach called SLAM-ASR. We present novel empirical findings that offer insights on how to effectively utilize the SLAM-ASR architecture across a wide range of settings. Our main findings indicate that the SLAM-ASR exhibits poor performance in cross-domain evaluation settings. Additionally, speech perturbations within in-domain data, such as changes in speed or the presence of additive noise, can significantly impact performance. Our findings offer critical insights for fine-tuning and configuring robust LLM-based ASR models, tailored to different data characteristics and computational resources.
摘要：近期研究表明，在语音基础编码器与大语言模型 (LLM) 之间训练一个线性连接器，可以使该架构具备强大的自动语音识别 (ASR) 能力。尽管取得了令人瞩目的成果，但这些简单方法在不同场景和语音条件下的鲁棒性，如领域转移和不同语音扰动，仍未明确。本文通过使用一种近期广泛采用的方法——SLAM-ASR，进行多种消融实验，来解答这些问题。我们提出了新的实证发现，为如何在广泛设置中有效利用 SLAM-ASR 架构提供了见解。我们的主要发现表明，SLAM-ASR 在跨领域评估设置中表现不佳。此外，领域内数据中的语音扰动，如速度变化或附加噪声的存在，会显著影响性能。我们的研究为针对不同数据特征和计算资源，微调和配置鲁棒的基于 LLM 的 ASR 模型提供了关键见解。

[NLP-20] MambaPEFT: Exploring Parameter-Efficient Fine-Tuning for Mamba

【速读】：该论文试图解决在Mamba模型上进行参数高效微调（Parameter-efficient fine-tuning, PEFT）的问题。解决方案的关键在于探索和优化现有的PEFT方法，使其适用于Mamba架构，并提出新的Mamba专用PEFT方法。通过实验验证，论文发现PEFT在Mamba上的效果优于Transformer，并展示了如何有效结合多种PEFT方法以超越现有工作，最终提供了一个优于以往研究的框架。

链接: https://arxiv.org/abs/2411.03855
作者: Masakazu Yoshimura,Teruaki Hayashi,Yota Maeda
关键词-EN: ecosystem of Transformer-based, building large models, Transformer-based models, extensive data, PEFT methods
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:An ecosystem of Transformer-based models has been established by building large models with extensive data. Parameter-efficient fine-tuning (PEFT) is a crucial technology for deploying these models to downstream tasks with minimal cost while achieving effective performance. Recently, Mamba, a State Space Model (SSM)-based model, has attracted attention as a potential alternative to Transformers. While many large-scale Mamba-based models have been proposed, efficiently adapting pre-trained Mamba-based models to downstream tasks remains unexplored. In this paper, we conduct an exploratory analysis of PEFT methods for Mamba. We investigate the effectiveness of existing PEFT methods for Transformers when applied to Mamba. We also modify these methods to better align with the Mamba architecture. Additionally, we propose new Mamba-specific PEFT methods that leverage the distinctive structure of Mamba. Our experiments indicate that PEFT performs more effectively for Mamba than Transformers. Lastly, we demonstrate how to effectively combine multiple PEFT methods and provide a framework that outperforms previous works. To ensure reproducibility, we will release the code after publication.
摘要：通过利用大量数据构建大型模型，基于 Transformer 的模型生态系统已经建立。参数高效微调 (Parameter-efficient fine-tuning, PEFT) 是部署这些模型到下游任务中的关键技术，能够在最小成本下实现有效性能。近期，基于状态空间模型 (State Space Model, SSM) 的 Mamba 模型引起了关注，被视为 Transformer 的潜在替代方案。尽管已提出多种大规模基于 Mamba 的模型，但如何高效地将预训练的 Mamba 模型适应到下游任务仍未得到充分探索。本文对 Mamba 的 PEFT 方法进行了探索性分析。我们研究了现有的适用于 Transformer 的 PEFT 方法在应用于 Mamba 时的有效性，并对其进行了改进以更好地适应 Mamba 架构。此外，我们还提出了利用 Mamba 独特结构的新型 PEFT 方法。实验结果表明，PEFT 在 Mamba 上的表现优于 Transformer。最后，我们展示了如何有效结合多种 PEFT 方法，并提供了一个超越先前工作的框架。为确保可复现性，我们将在发表后公开代码。

[NLP-21] From Novice to Expert: LLM Agent Policy Optimization via Step-wise Reinforcement Learning

【速读】：该论文试图解决大语言模型（LLMs）在自主代理系统中进行复杂交互任务时面临的稀疏奖励问题。传统方法依赖于LLMs的固有知识，而近期方法转向强化学习策略以提升代理能力，但现有数据集仅提供最终的标量奖励，可能导致策略学习效果不佳。论文提出的解决方案是引入StepAgent，通过逐步奖励（step-wise reward）优化代理的强化学习过程。关键在于利用专家与代理行为的对比自动生成中间奖励，结合隐式奖励（implicit-reward）和逆强化学习技术（inverse reinforcement learning）促进代理的反思和策略调整，从而使代理的动作分布在多次训练周期内趋近于专家动作分布。实验结果表明，StepAgent在多个数据集上优于现有基线方法。

链接: https://arxiv.org/abs/2411.03817
作者: Zhirui Deng,Zhicheng Dou,Yutao Zhu,Ji-Rong Wen,Ruibin Xiong,Mang Wang,Weipeng Chen
关键词-EN: large language models, autonomous agent systems, language models, outstanding capabilities, capabilities of large
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The outstanding capabilities of large language models (LLMs) render them a crucial component in various autonomous agent systems. While traditional methods depend on the inherent knowledge of LLMs without fine-tuning, more recent approaches have shifted toward the reinforcement learning strategy to further enhance agents’ ability to solve complex interactive tasks with environments and tools. However, previous approaches are constrained by the sparse reward issue, where existing datasets solely provide a final scalar reward for each multi-step reasoning chain, potentially leading to ineffectiveness and inefficiency in policy learning. In this paper, we introduce StepAgent, which utilizes step-wise reward to optimize the agent’s reinforcement learning process. Inheriting the spirit of novice-to-expert theory, we first compare the actions of the expert and the agent to automatically generate intermediate rewards for fine-grained optimization. Additionally, we propose implicit-reward and inverse reinforcement learning techniques to facilitate agent reflection and policy adjustment. Further theoretical analysis demonstrates that the action distribution of the agent can converge toward the expert action distribution over multiple training cycles. Experimental results across various datasets indicate that StepAgent outperforms existing baseline methods.
摘要：大语言模型（LLMs）的卓越能力使其成为各种自主智能体系统中的关键组件。传统方法依赖于LLMs的固有知识而无需微调，而近期方法则转向强化学习策略，以进一步提升智能体在复杂交互任务中与环境和工具协同工作的能力。然而，先前的方法受限于稀疏奖励问题，即现有数据集仅在每个多步骤推理链的末尾提供一个标量奖励，这可能导致策略学习的效果和效率不佳。本文中，我们提出了StepAgent，该方法利用逐步奖励来优化智能体的强化学习过程。继承新手到专家理论的精神，我们首先比较专家和智能体的动作，以自动生成中间奖励，实现细粒度优化。此外，我们提出了隐式奖励和逆强化学习技术，以促进智能体的反思和策略调整。进一步的理论分析表明，智能体的动作分布在多次训练周期后可以收敛到专家动作分布。在多个数据集上的实验结果显示，StepAgent优于现有的基线方法。

[NLP-22] MRJ-Agent : An Effective Jailbreak Agent for Multi-Round Dialogue

【速读】：该论文试图解决大型语言模型（LLMs）在多轮对话中容易被“越狱”攻击的问题。解决方案的关键在于提出了一种新颖的多轮对话越狱代理（multi-round dialogue jailbreaking agent），强调隐秘性在识别和缓解LLMs对人类价值观构成的潜在威胁中的重要性。论文提出了一种风险分解策略，将风险分散到多个查询轮次中，并利用心理学策略增强攻击强度。实验结果表明，该方法在攻击成功率上超越了其他攻击方法，达到了最先进的水平。

链接: https://arxiv.org/abs/2411.03814
作者: Fengxiang Wang,Ranjie Duan,Peng Xiao,Xiaojun Jia,YueFeng Chen,Chongwen Wang,Jialing Tao,Hang Su,Jun Zhu,Hui Xue
关键词-EN: Large Language Models, Large Language, Language Models, demonstrate outstanding performance, demonstrate outstanding
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate outstanding performance in their reservoir of knowledge and understanding capabilities, but they have also been shown to be prone to illegal or unethical reactions when subjected to jailbreak attacks. To ensure their responsible deployment in critical applications, it is crucial to understand the safety capabilities and vulnerabilities of LLMs. Previous works mainly focus on jailbreak in single-round dialogue, overlooking the potential jailbreak risks in multi-round dialogues, which are a vital way humans interact with and extract information from LLMs. Some studies have increasingly concentrated on the risks associated with jailbreak in multi-round dialogues. These efforts typically involve the use of manually crafted templates or prompt engineering techniques. However, due to the inherent complexity of multi-round dialogues, their jailbreak performance is limited. To solve this problem, we propose a novel multi-round dialogue jailbreaking agent, emphasizing the importance of stealthiness in identifying and mitigating potential threats to human values posed by LLMs. We propose a risk decomposition strategy that distributes risks across multiple rounds of queries and utilizes psychological strategies to enhance attack strength. Extensive experiments show that our proposed method surpasses other attack methods and achieves state-of-the-art attack success rate. We will make the corresponding code and dataset available for future research. The code will be released soon.
摘要：大语言模型 (LLM) 在其知识储备和理解能力方面表现出色，但当遭受越狱攻击时，它们也容易产生非法或不道德的反应。为了确保其在关键应用中的负责任部署，理解 LLM 的安全能力和漏洞至关重要。以往的研究主要集中在单轮对话中的越狱问题，忽视了多轮对话中潜在的越狱风险，而多轮对话是人类与 LLM 互动并从中提取信息的重要方式。一些研究已逐渐聚焦于多轮对话中的越狱风险。这些研究通常涉及使用手工制作的模板或提示工程技术。然而，由于多轮对话的固有复杂性，其越狱效果有限。为解决这一问题，我们提出了一种新颖的多轮对话越狱智能体，强调在识别和缓解 LLM 对人类价值观构成的潜在威胁时，隐秘性的重要性。我们提出了一种风险分解策略，将风险分散到多轮查询中，并利用心理策略增强攻击强度。大量实验表明，我们提出的方法超越了其他攻击方法，达到了最先进的攻击成功率。我们将提供相应的代码和数据集供未来研究使用。代码即将发布。

[NLP-23] he natural stability of autonomous morphology

【速读】：该论文试图解决自主形态学（autonomous morphology）在自然语言中广泛存在且历时稳定的原因。解决方案的关键在于提出了一个历时动态的吸引与排斥机制，该机制自发地从简单的词形变化填充过程中产生。具体来说，论文通过计算进化模型揭示了“分离证据”（dissociative evidence）在类比推理中的作用，这种证据导致了形态类别之间的排斥动态，从而防止形态类别完全合并，即完全的平级化（levelling）。此外，论文还探讨了条件熵（conditional entropy）作为预测性度量的局限性，并证明自主形态学是词形变化系统中自然（理性）推理过程的自然（涌现）结果。

链接: https://arxiv.org/abs/2411.03811
作者: Erich Round,Louise Esher,Sacha Beniamine
关键词-EN: paradigmatic distribution patterns, Autonomous morphology, inflection class systems, distribution patterns, inflection class
类目: Computation and Language (cs.CL)
备注: Accepted for publication by the journal Morphology

点击查看摘要

Abstract:Autonomous morphology, such as inflection class systems and paradigmatic distribution patterns, is widespread and diachronically resilient in natural language. Why this should be so has remained unclear given that autonomous morphology imposes learning costs, offers no clear benefit relative to its absence and could easily be removed by the analogical forces which are constantly reshaping it. Here we propose an explanation for the resilience of autonomous morphology, in terms of a diachronic dynamic of attraction and repulsion between morphomic categories, which emerges spontaneously from a simple paradigm cell filling process. Employing computational evolutionary models, our key innovation is to bring to light the role of dissociative evidence', i.e., evidence for inflectional distinctiveness which a rational reasoner will have access to during analogical inference. Dissociative evidence creates a repulsion dynamic which prevents morphomic classes from collapsing together entirely, i.e., undergoing complete levelling. As we probe alternative models, we reveal the limits of conditional entropy as a measure for predictability in systems that are undergoing change. Finally, we demonstrate that autonomous morphology, far from being unnatural’ (e.g. \citealtAronoff1994), is rather the natural (emergent) consequence of a natural (rational) process of inference applied to inflectional systems.
摘要：自主形态学，如屈折类系统和范例分布模式，在自然语言中广泛存在且具有历时韧性。鉴于自主形态学增加了学习成本，相对于其缺失并未提供明显优势，并且容易被不断重塑其的类比力量所消除，为何其仍然存在一直未明。在此，我们提出了一种解释，即自主形态学的韧性源于形态类别之间历时动态的吸引与排斥，这种动态自发地从简单的范例单元填充过程中产生。通过计算进化模型，我们的关键创新在于揭示了“分离证据”的作用，即在类比推理过程中理性推理者可获得的屈折独特性证据。分离证据产生了一种排斥动态，防止形态类别完全合并，即完全趋同。在探究替代模型时，我们揭示了条件熵作为变化系统中可预测性度量的局限性。最后，我们证明自主形态学远非“不自然”（如 Aronoff 1994 年所述），而是屈折系统中自然（理性）推理过程的自然（涌现）结果。

[NLP-24] Understanding the Effects of Human-written Paraphrases in LLM -generated Text Detection

【速读】：该论文试图解决的问题是如何在包含人类撰写和LLM生成的文本及其改写版本的情况下，评估和提升现有LLM生成文本检测模型的性能。解决方案的关键在于设计了一种新的数据收集策略，构建了首个包含人类撰写文本及其改写版本、LLM生成文本及其改写版本的Human LLM Paraphrase Collection (HLPC)数据集。通过使用这一数据集，研究了人类撰写文本的改写对现有LLM生成文本检测模型（如OpenAI RoBERTa和watermark detectors）性能的影响，结果表明，引入人类撰写的改写文本显著影响了LLM生成文本检测模型的性能，特别是在True Positive Rate (TPR)和False Positive Rate (FPR)的平衡上，尽管可能会牺牲部分AUROC和准确性。

链接: https://arxiv.org/abs/2411.03806
作者: Hiu Ting Lau,Arkaitz Zubiaga
关键词-EN: Natural Language Generation, Natural Language, Language Generation, large language models, large language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Natural Language Generation has been rapidly developing with the advent of large language models (LLMs). While their usage has sparked significant attention from the general public, it is important for readers to be aware when a piece of text is LLM-generated. This has brought about the need for building models that enable automated LLM-generated text detection, with the aim of mitigating potential negative outcomes of such content. Existing LLM-generated detectors show competitive performances in telling apart LLM-generated and human-written text, but this performance is likely to deteriorate when paraphrased texts are considered. In this study, we devise a new data collection strategy to collect Human LLM Paraphrase Collection (HLPC), a first-of-its-kind dataset that incorporates human-written texts and paraphrases, as well as LLM-generated texts and paraphrases. With the aim of understanding the effects of human-written paraphrases on the performance of state-of-the-art LLM-generated text detectors OpenAI RoBERTa and watermark detectors, we perform classification experiments that incorporate human-written paraphrases, watermarked and non-watermarked LLM-generated documents from GPT and OPT, and LLM-generated paraphrases from DIPPER and BART. The results show that the inclusion of human-written paraphrases has a significant impact of LLM-generated detector performance, promoting TPR@1%FPR with a possible trade-off of AUROC and accuracy.
摘要：随着大语言模型（LLMs）的出现，自然语言生成技术得到了迅速发展。尽管这些模型的应用引起了公众的广泛关注，但读者在阅读由 LLM 生成的文本时，应具备识别能力。这促使了构建能够自动检测 LLM 生成文本的模型，以减轻此类内容可能带来的负面影响。现有的 LLM 生成文本检测器在区分 LLM 生成文本和人类撰写文本方面表现出较强的竞争力，但当考虑改写文本时，其性能可能会下降。在本研究中，我们设计了一种新的数据收集策略，以收集人类 LLM 改写集合（HLPC），这是一个首创的数据集，包含了人类撰写的文本及其改写版本，以及 LLM 生成的文本及其改写版本。为了理解人类撰写的改写文本对最先进的 LLM 生成文本检测器 OpenAI RoBERTa 和水印检测器性能的影响，我们进行了分类实验，这些实验包括了来自 GPT 和 OPT 的带水印和不带水印的 LLM 生成文档，以及来自 DIPPER 和 BART 的 LLM 生成改写文本。结果显示，人类撰写的改写文本的加入对 LLM 生成文本检测器的性能有显著影响，促进了 TPR@1%FPR，但可能以 AUROC 和准确性为代价。

[NLP-25] A Comparative Study of Recent Large Language Models on Generating Hospital Discharge Summaries for Lung Cancer Patients

【速读】：该论文试图解决临床实践中生成出院总结（discharge summaries）这一耗时且关键的任务，旨在通过利用大型语言模型（LLMs）来减轻手动总结的负担，提高工作流程效率，并支持医疗环境中的决策制定。解决方案的关键在于评估和比较多种LLMs（如GPT-3.5、GPT-4、GPT-4o、LLaMA 3 8b）在生成出院总结方面的性能，通过token-level分析（如BLEU、ROUGE-1、ROUGE-2、ROUGE-L）和语义相似度评分来衡量模型生成的总结与医生编写的标准之间的匹配度。研究发现，GPT-4o和经过微调的LLaMA 3在token-level评估指标和语义相似度方面表现优异，而LLaMA 3在不同输入长度下均能生成简洁且相关的总结，显示出其在临床环境中的稳定性和有效性。

链接: https://arxiv.org/abs/2411.03805
作者: Yiming Li,Fang Li,Kirk Roberts,Licong Cui,Cui Tao,Hua Xu
关键词-EN: conveying pertinent patient, pertinent patient information, Generating discharge summaries, essential for conveying, crucial yet time-consuming
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generating discharge summaries is a crucial yet time-consuming task in clinical practice, essential for conveying pertinent patient information and facilitating continuity of care. Recent advancements in large language models (LLMs) have significantly enhanced their capability in understanding and summarizing complex medical texts. This research aims to explore how LLMs can alleviate the burden of manual summarization, streamline workflow efficiencies, and support informed decision-making in healthcare settings. Clinical notes from a cohort of 1,099 lung cancer patients were utilized, with a subset of 50 patients for testing purposes, and 102 patients used for model fine-tuning. This study evaluates the performance of multiple LLMs, including GPT-3.5, GPT-4, GPT-4o, and LLaMA 3 8b, in generating discharge summaries. Evaluation metrics included token-level analysis (BLEU, ROUGE-1, ROUGE-2, ROUGE-L) and semantic similarity scores between model-generated summaries and physician-written gold standards. LLaMA 3 8b was further tested on clinical notes of varying lengths to examine the stability of its performance. The study found notable variations in summarization capabilities among LLMs. GPT-4o and fine-tuned LLaMA 3 demonstrated superior token-level evaluation metrics, while LLaMA 3 consistently produced concise summaries across different input lengths. Semantic similarity scores indicated GPT-4o and LLaMA 3 as leading models in capturing clinical relevance. This study contributes insights into the efficacy of LLMs for generating discharge summaries, highlighting LLaMA 3’s robust performance in maintaining clarity and relevance across varying clinical contexts. These findings underscore the potential of automated summarization tools to enhance documentation precision and efficiency, ultimately improving patient care and operational capability in healthcare settings.
摘要：生成出院总结是临床实践中一项至关重要的任务，尽管耗时，但对于传达相关患者信息和促进护理连续性至关重要。大语言模型（LLMs）的最新进展显著提升了其理解和总结复杂医学文本的能力。本研究旨在探讨LLMs如何减轻手动总结的负担，提高工作流程效率，并支持医疗环境中的决策制定。研究使用了1,099名肺癌患者的临床笔记，其中50名患者用于测试，102名患者用于模型微调。本研究评估了多个LLMs（包括GPT-3.5、GPT-4、GPT-4o和LLaMA 3 8b）生成出院总结的性能。评估指标包括Token级别的分析（BLEU、ROUGE-1、ROUGE-2、ROUGE-L）以及模型生成总结与医生编写的黄金标准之间的语义相似度得分。LLaMA 3 8b进一步在不同长度的临床笔记上进行了测试，以检验其性能的稳定性。研究发现，LLMs在总结能力上存在显著差异。GPT-4o和微调后的LLaMA 3在Token级别评估指标上表现优异，而LLaMA 3在不同输入长度下始终能生成简洁的总结。语义相似度得分表明，GPT-4o和LLaMA 3在捕捉临床相关性方面处于领先地位。本研究为LLMs生成出院总结的有效性提供了见解，突显了LLaMA 3在不同临床情境下保持清晰度和相关性的稳健表现。这些发现强调了自动化总结工具在提高文档精度和效率方面的潜力，最终改善患者护理和医疗环境中的运营能力。

[NLP-26] No Culture Left Behind: ArtELingo-28 a Benchmark of WikiArt with Captions in 28 Languages EMNLP24 WWW

【速读】：该论文试图解决视觉与语言研究中多语言多样性不足的问题，特别是情感描述和跨文化表达的缺乏。解决方案的关键在于提出了ArtELingo-28，这是一个涵盖28种语言的视觉语言基准，包含约200,000个注释，每张图像有140个注释。与传统的视觉研究不同，ArtELingo-28强调语言和文化间的多样性，挑战在于构建能够为图像分配情感描述的机器学习系统。论文还提出了三种新颖的实验条件：零样本学习（Zero-Shot）、少样本学习（Few-Shot）和一对多零样本学习（One-vs-All Zero-Shot），并发现跨语言迁移在文化相关语言间更为成功。

链接: https://arxiv.org/abs/2411.03769
作者: Youssef Mohamed,Runjia Li,Ibrahim Said Ahmad,Kilichbek Haydarov,Philip Torr,Kenneth Ward Church,Mohamed Elhoseiny
关键词-EN: made considerable progress, Chinese and Arabic, made considerable, considerable progress, COCO captions focused
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 9 pages, Accepted at EMNLP 24, for more details see this http URL

点击查看摘要

Abstract:Research in vision and language has made considerable progress thanks to benchmarks such as COCO. COCO captions focused on unambiguous facts in English; ArtEmis introduced subjective emotions and ArtELingo introduced some multilinguality (Chinese and Arabic). However we believe there should be more multilinguality. Hence, we present ArtELingo-28, a vision-language benchmark that spans \textbf28 languages and encompasses approximately \textbf200,000 annotations ( \textbf140 annotations per image). Traditionally, vision research focused on unambiguous class labels, whereas ArtELingo-28 emphasizes diversity of opinions over languages and cultures. The challenge is to build machine learning systems that assign emotional captions to images. Baseline results will be presented for three novel conditions: Zero-Shot, Few-Shot and One-vs-All Zero-Shot. We find that cross-lingual transfer is more successful for culturally-related languages. Data and code are provided at this http URL.
摘要：视觉与语言研究领域取得了显著进展，这得益于诸如 COCO 等基准测试。COCO 的描述主要集中在英语中的明确事实；ArtEmis 引入了主观情感，而 ArtELingo 则引入了一些多语言特性（中文和阿拉伯语）。然而，我们认为应该有更多的多语言支持。因此，我们推出了 ArtELingo-28，这是一个涵盖 28 种语言的视觉语言基准测试，包含约 200,000 条注释（每张图片约 140 条注释）。传统上，视觉研究侧重于明确的类别标签，而 ArtELingo-28 则强调跨语言和文化的意见多样性。挑战在于构建能够为图像分配情感描述的机器学习系统。我们将展示三种新条件下的基线结果：零样本、少样本和一对一零样本。我们发现，跨语言迁移在文化相关的语言之间更为成功。数据和代码可在以下网址获取：http URL。

[NLP-27] Number Cookbook: Number Understanding of Language Models and How to Improve It

【速读】：该论文试图解决大型语言模型（LLMs）在基本数值理解和处理（Numerical Understanding and Processing, NUPA）方面的不足问题。解决方案的关键在于引入一个全面的基准测试，涵盖四种常见的数值表示和17种不同的数值任务，共计41种组合。通过该基准测试，论文发现当前的LLMs在许多任务中表现不佳。为了研究这一问题，论文训练了使用现有和潜在技术增强NUPA的小模型，并评估了这些技术的效果。此外，论文还对实际规模的LLMs进行了微调，发现尽管简单的微调可以显著提高某些任务的NUPA，但专门设计用于增强NUPA的技术在微调预训练模型时效果不佳。论文还探讨了思维链技术对NUPA的影响，为理解和改进LLMs的NUPA能力迈出了初步的一步。

链接: https://arxiv.org/abs/2411.03766
作者: Haotong Yang,Yi Hu,Shijia Kang,Zhouchen Lin,Muhan Zhang
关键词-EN: Large language models, making surprising mistakes, Large language, basic numerical understanding, complex reasoning tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can solve an increasing number of complex reasoning tasks while making surprising mistakes in basic numerical understanding and processing (such as 9.11 9.9). The latter ability is essential for tackling complex arithmetic and mathematical problems and serves as a foundation for most reasoning tasks, but previous work paid little attention to it or only discussed several restricted tasks (like integer addition). In this paper, we comprehensively investigate the numerical understanding and processing ability (NUPA) of LLMs. Firstly, we introduce a benchmark covering four common numerical representations and 17 distinct numerical tasks in four major categories, resulting in 41 meaningful combinations in total. These tasks are derived from primary and secondary education curricula, encompassing nearly all everyday numerical understanding and processing scenarios, and the rules of these tasks are very simple and clear. Through the benchmark, we find that current LLMs fail frequently in many of the tasks. To study the problem, we train small models with existing and potential techniques for enhancing NUPA (such as special tokenizers, PEs, and number formats), comprehensively evaluating their effectiveness using our testbed. We also finetune practical-scale LLMs on our proposed NUPA tasks and find that 1) naive finetuning can improve NUPA a lot on many but not all tasks, and 2) surprisingly, techniques designed to enhance NUPA prove ineffective for finetuning pretrained models. We further explore the impact of chain-of-thought techniques on NUPA. Our work takes a preliminary step towards understanding and improving NUPA of LLMs. Our benchmark and code are released at this https URL.
摘要：大语言模型（LLMs）在解决越来越多的复杂推理任务的同时，在基础的数值理解和处理（如 9.11 与 9.9 的比较）上却出现了令人惊讶的错误。后一种能力对于解决复杂的算术和数学问题至关重要，并且是大多数推理任务的基础，但以往的研究对此关注甚少，或者仅讨论了几个受限的任务（如整数加法）。本文全面研究了 LLMs 的数值理解和处理能力（NUPA）。首先，我们引入了一个涵盖四种常见数值表示和四大类中 17 种不同数值任务的基准，总计形成了 41 个有意义的组合。这些任务源自中小学教育课程，几乎涵盖了所有日常数值理解和处理的场景，且这些任务的规则非常简单明了。通过该基准，我们发现当前的 LLMs 在许多任务中频繁失败。为了研究这个问题，我们训练了使用现有和潜在技术（如特殊 Tokenizer、PE 和数字格式）增强 NUPA 的小模型，并全面评估了它们在我们测试平台上的有效性。我们还对实际规模的 LLMs 进行了我们提出的 NUPA 任务的微调，发现：1）简单的微调可以在许多但并非所有任务上显著提升 NUPA；2）令人惊讶的是，旨在增强 NUPA 的技术对微调预训练模型证明无效。我们进一步探讨了思维链技术对 NUPA 的影响。我们的工作为理解和提升 LLMs 的 NUPA 迈出了初步的一步。我们的基准和代码已在以下链接发布。

[NLP-28] he Root Shapes the Fruit: On the Persistence of Gender-Exclusive Harms in Aligned Language Models NEURIPS

【速读】：该论文试图解决的问题是：自然语言助手在通过与人类偏好对齐来避免有害输出的过程中，是否可能无意中延续或放大其基础模型中已有的有害偏见，特别是针对性别多样性群体的偏见。解决方案的关键在于：1) 对领先的偏好微调大型语言模型（LLMs）进行全面的偏见评估模式调查，揭示性别多样性代表性方面的关键差距；2) 系统地评估12个模型在直接偏好优化（DPO）阶段的性别多样性偏见，发现现有偏见基准未能检测到的危害；3) 提出一个灵活的框架，用于测量隐含奖励信号中的有害偏见，适用于其他社会背景。研究结果表明，DPO对齐模型对监督微调（SFT）特别敏感，并可能放大基础模型中的两种现实世界性别多样性危害：污名化和非性别肯定性语言。最终，论文建议采用社区导向的偏见评估框架，以更有效地识别和解决LLMs中未被充分代表群体的危害。

链接: https://arxiv.org/abs/2411.03700
作者: Anaelia Ovalle,Krunoslav Lehman Pavasovic,Louis Martin,Luke Zettlemoyer,Eric Michael Smith,Adina Williams,Levent Sagun
关键词-EN: avoiding harmful outputs, Natural-language assistants, largely achieved, assistants are designed, designed to provide
类目: Computation and Language (cs.CL)
备注: Accepted to 2024 Neurips Queer in AI Workshop

点击查看摘要

Abstract:Natural-language assistants are designed to provide users with helpful responses while avoiding harmful outputs, largely achieved through alignment to human preferences. Yet there is limited understanding of whether alignment techniques may inadvertently perpetuate or even amplify harmful biases inherited from their pre-aligned base models. This issue is compounded by the choice of bias evaluation benchmarks in popular preference-finetuned models, which predominantly focus on dominant social categories, such as binary gender, thereby limiting insights into biases affecting underrepresented groups. Towards addressing this gap, we center transgender, nonbinary, and other gender-diverse identities to investigate how alignment procedures interact with pre-existing gender-diverse bias in LLMs. Our key contributions include: 1) a comprehensive survey of bias evaluation modalities across leading preference-finetuned LLMs, highlighting critical gaps in gender-diverse representation, 2) systematic evaluation of gender-diverse biases across 12 models spanning Direct Preference Optimization (DPO) stages, uncovering harms popular bias benchmarks fail to detect, and 3) a flexible framework for measuring harmful biases in implicit reward signals applicable to other social contexts. Our findings reveal that DPO-aligned models are particularly sensitive to supervised finetuning (SFT), and can amplify two forms of real-world gender-diverse harms from their base models: stigmatization and gender non-affirmative language. We conclude with recommendations tailored to DPO and broader alignment practices, advocating for the adoption of community-informed bias evaluation frameworks to more effectively identify and address underrepresented harms in LLMs.
摘要：自然语言助手旨在为用户提供有益的回应，同时避免有害输出，这主要通过与人类偏好对齐来实现。然而，目前对于对齐技术是否可能无意中延续甚至放大其预对齐基础模型中固有的有害偏见，理解尚不充分。这一问题因流行偏好微调模型中偏见评估基准的选择而加剧，这些基准主要关注主导社会类别，如二元性别，从而限制了对影响少数群体偏见的洞察。为了解决这一差距，我们将跨性别者、非二元性别者及其他性别多样性身份置于中心，研究对齐过程如何与大语言模型（LLMs）中预先存在的性别多样性偏见相互作用。我们的主要贡献包括：1) 对领先偏好微调LLMs中偏见评估模式的全面调查，突显性别多样性代表性方面的关键差距；2) 对12个模型在直接偏好优化（DPO）阶段进行性别多样性偏见的系统评估，揭示了流行偏见基准未能检测到的危害；3) 一个灵活的框架，用于测量适用于其他社会背景的隐性奖励信号中的有害偏见。我们的研究结果表明，DPO对齐模型对监督微调（SFT）特别敏感，并可能从其基础模型中放大两种形式的现实世界性别多样性危害：污名化和非性别肯定性语言。我们最后提出了针对DPO及更广泛对齐实践的建议，主张采用社区导向的偏见评估框架，以更有效地识别和解决LLMs中少数群体的危害。

[NLP-29] QUILL: Quotation Generation Enhancement of Large Language Models

【速读】：该论文试图解决大型语言模型（LLMs）在引文生成任务中的不足，即模型在提供事实性引文时容易产生幻觉，或在生成超出人类预期的引文时表现不佳。解决方案的关键在于建立了一个全面的自动评估系统，该系统包含五个评估标准及其相应的自动度量方法。此外，论文构建了一个包含多达32,022条引文的双语知识库，并通过设计的引文特定度量方法对从知识库中检索到的引文进行重新排序，以提高引文生成的质量。实验结果表明，这些度量方法与人类偏好高度相关，有效缩小了现有LLMs在引文生成方面的差距。

链接: https://arxiv.org/abs/2411.03675
作者: Jin Xiao,Bowei Zhang,Qianyu He,Jiaqing Liang,Feng Wei,Jinglei Chen,Zujie Liang,Deqing Yang,Yanghua Xiao
关键词-EN: Large language models, excellent writing assistants, Large language, language models, writing assistants
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 6 figures

点击查看摘要

Abstract:While Large language models (LLMs) have become excellent writing assistants, they still struggle with quotation generation. This is because they either hallucinate when providing factual quotations or fail to provide quotes that exceed human expectations. To bridge the gap, we systematically study how to evaluate and improve LLMs’ performance in quotation generation tasks. We first establish a holistic and automatic evaluation system for quotation generation task, which consists of five criteria each with corresponding automatic metric. To improve the LLMs’ quotation generation abilities, we construct a bilingual knowledge base that is broad in scope and rich in dimensions, containing up to 32,022 quotes. Moreover, guided by our critiria, we further design a quotation-specific metric to rerank the retrieved quotations from the knowledge base. Extensive experiments show that our metrics strongly correlate with human preferences. Existing LLMs struggle to generate desired quotes, but our quotation knowledge base and reranking metric help narrow this gap. Our dataset and code are publicly available at this https URL.
摘要：尽管大语言模型 (LLMs) 已成为出色的写作助手，但在引述生成方面仍面临挑战。这是因为它们在提供事实引述时容易产生幻觉，或者无法提供超出人类预期的引述。为了弥补这一差距，我们系统地研究了如何评估和提升大语言模型在引述生成任务中的表现。首先，我们建立了一个全面的自动化评估系统，用于引述生成任务，该系统包含五个标准，每个标准都有相应的自动化指标。为了提升大语言模型的引述生成能力，我们构建了一个广泛且多维的双语知识库，其中包含多达32,022条引述。此外，在我们的标准指导下，我们进一步设计了一个专门针对引述的指标，用于对从知识库中检索到的引述进行重新排序。大量实验表明，我们的指标与人类偏好高度相关。现有的大语言模型在生成理想引述方面存在困难，但我们的引述知识库和重新排序指标有助于缩小这一差距。我们的数据集和代码已在以下链接公开：https URL。

[NLP-30] Evaluating Moral Beliefs across LLM s through a Pluralistic Framework

【速读】：该论文试图解决如何评估大型语言模型（Large Language Models, LLMs）的道德信念（moral beliefs）的问题。解决方案的关键在于引入了一个创新的三模块框架，该框架包括：1) 构建一个包含472个道德选择场景的中文数据集，这些场景源自道德词汇；2) 通过模型在这些场景中的决策过程，揭示其道德原则偏好；3) 通过道德辩论，考察模型对其道德选择的坚定性。研究结果表明，英语语言模型如ChatGPT和Gemini表现出强烈的个人主义道德信念，而中文模型如Ernie和ChatGLM则倾向于集体主义道德信念，并揭示了所有模型中存在的性别偏见。该方法为评估人工智能和人类智能的道德信念提供了一种创新手段，并促进了不同文化间道德价值观的比较。

链接: https://arxiv.org/abs/2411.03665
作者: Xuelin Liu,Yanfei Zhu,Shucheng Zhu,Pengyuan Liu,Ying Liu,Dong Yu
关键词-EN: Proper moral beliefs, moral, moral beliefs, Proper moral, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Proper moral beliefs are fundamental for language models, yet assessing these beliefs poses a significant challenge. This study introduces a novel three-module framework to evaluate the moral beliefs of four prominent large language models. Initially, we constructed a dataset containing 472 moral choice scenarios in Chinese, derived from moral words. The decision-making process of the models in these scenarios reveals their moral principle preferences. By ranking these moral choices, we discern the varying moral beliefs held by different language models. Additionally, through moral debates, we investigate the firmness of these models to their moral choices. Our findings indicate that English language models, namely ChatGPT and Gemini, closely mirror moral decisions of the sample of Chinese university students, demonstrating strong adherence to their choices and a preference for individualistic moral beliefs. In contrast, Chinese models such as Ernie and ChatGLM lean towards collectivist moral beliefs, exhibiting ambiguity in their moral choices and debates. This study also uncovers gender bias embedded within the moral beliefs of all examined language models. Our methodology offers an innovative means to assess moral beliefs in both artificial and human intelligence, facilitating a comparison of moral values across different cultures.
摘要：正确的道德信念对于语言模型至关重要，然而评估这些信念却是一个重大挑战。本研究引入了一种新颖的三模块框架，用于评估四个著名大语言模型的道德信念。首先，我们构建了一个包含472个道德选择情境的数据集，这些情境源自道德词汇，并以中文呈现。模型在这些情境中的决策过程揭示了它们对道德原则的偏好。通过这些道德选择的排序，我们可以识别出不同语言模型所持有的不同道德信念。此外，通过道德辩论，我们考察了这些模型对其道德选择的坚定性。研究结果表明，英语语言模型，如ChatGPT和Gemini，其道德决策与样本中的中国大学生非常接近，显示出对自身选择的强烈坚持以及对个人主义道德信念的偏好。相比之下，中文模型如Ernie和ChatGLM则倾向于集体主义道德信念，在道德选择和辩论中表现出模糊性。本研究还揭示了所有被考察的语言模型中嵌入的性别偏见。我们的方法为评估人工智能和人类智能中的道德信念提供了一种创新手段，促进了不同文化间道德价值的比较。

[NLP-31] Deploying Multi-task Online Server with Large Language Model COLING2025

【速读】：该论文试图解决在工业环境中部署大量任务时，传统单任务模型开发和扩展成本过高的问题。解决方案的关键在于提出了一种三阶段的多任务学习框架，包括任务筛选、在高资源任务上微调，以及在所有任务上微调。通过这种方法，论文展示了在不同基准测试中，多任务学习框架能够在性能上与单任务方法相媲美，同时减少高达90.9%的开发和扩展成本。

链接: https://arxiv.org/abs/2411.03644
作者: Yincen Qu,Chao Ma,Yiting Wu,Xiangying Dai,Hui Zhou,Hengyue Liu
关键词-EN: deployed online, large language models, language models, numerous tasks, models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: COLING2025 under submission

点击查看摘要

Abstract:In the industry, numerous tasks are deployed online. Traditional approaches often tackle each task separately by its own network, which leads to excessive costs for developing and scaling models, especially in the context of large language models. Although multi-task methods can save costs through parameter sharing, they often struggle to outperform single-task methods in real-world applications. To tackle these challenges, we present a three-stage multi-task learning framework for large language models. It involves task filtering, followed by fine-tuning on high-resource tasks, and finally fine-tuning on all tasks. We conducted comprehensive experiments in single-task and multi-task settings. Our approach, exemplified on different benchmarks, demonstrates that it is able to achieve performance comparable to the single-task method while reducing up to 90.9% of its overhead.
摘要：在工业领域，众多任务被部署于线上。传统方法通常通过各自的网络分别处理每个任务，这导致了在开发和扩展模型时，尤其是在大语言模型的背景下，成本过高。尽管多任务方法通过参数共享可以节省成本，但在实际应用中，它们往往难以超越单任务方法。为了应对这些挑战，我们提出了一种针对大语言模型的三阶段多任务学习框架。该框架包括任务筛选、对高资源任务进行微调，以及对所有任务进行最终微调。我们在单任务和多任务设置下进行了全面的实验。我们的方法在不同基准测试中的实例化表明，它能够在减少高达90.9%的开销的同时，实现与单任务方法相媲美的性能。

[NLP-32] From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

【速读】：该论文试图解决在大语言模型（LLMs）中如何通过运行时策略提升其在专业领域（如医学）的性能问题。解决方案的关键在于理解和优化运行时推理模型（如OpenAI的o1-preview）在医学挑战问题基准测试中的表现。论文通过系统评估o1-preview模型在各种医学基准上的表现，发现即使不使用提示技术，o1-preview也能显著超越使用Medprompt的GPT-4系列。此外，论文探讨了经典提示工程策略（如Medprompt）在推理模型新范式中的有效性，发现少样本提示可能不再适用于推理原生模型，而集成方法虽有效但资源密集。最终，论文通过成本和准确性分析，揭示了不同运行时策略的帕累托前沿，指出GPT-4o在特定情况下仍具有价值，而o1-preview则在更高成本下实现了最先进的性能。

链接: https://arxiv.org/abs/2411.03590
作者: Harsha Nori,Naoto Usuyama,Nicholas King,Scott Mayer McKinney,Xavier Fernandes,Sheng Zhang,Eric Horvitz
关键词-EN: guiding large language, large language models, Medprompt, valuable for guiding, guiding large
类目: Computation and Language (cs.CL)
备注: 25 pages

点击查看摘要

Abstract:Run-time steering strategies like Medprompt are valuable for guiding large language models (LLMs) to top performance on challenging tasks. Medprompt demonstrates that a general LLM can be focused to deliver state-of-the-art performance on specialized domains like medicine by using a prompt to elicit a run-time strategy involving chain of thought reasoning and ensembling. OpenAI’s o1-preview model represents a new paradigm, where a model is designed to do run-time reasoning before generating final responses. We seek to understand the behavior of o1-preview on a diverse set of medical challenge problem benchmarks. Following on the Medprompt study with GPT-4, we systematically evaluate the o1-preview model across various medical benchmarks. Notably, even without prompting techniques, o1-preview largely outperforms the GPT-4 series with Medprompt. We further systematically study the efficacy of classic prompt engineering strategies, as represented by Medprompt, within the new paradigm of reasoning models. We found that few-shot prompting hinders o1’s performance, suggesting that in-context learning may no longer be an effective steering approach for reasoning-native models. While ensembling remains viable, it is resource-intensive and requires careful cost-performance optimization. Our cost and accuracy analysis across run-time strategies reveals a Pareto frontier, with GPT-4o representing a more affordable option and o1-preview achieving state-of-the-art performance at higher cost. Although o1-preview offers top performance, GPT-4o with steering strategies like Medprompt retains value in specific contexts. Moreover, we note that the o1-preview model has reached near-saturation on many existing medical benchmarks, underscoring the need for new, challenging benchmarks. We close with reflections on general directions for inference-time computation with LLMs.
摘要：运行时引导策略，如 Medprompt，对于指导大语言模型 (LLMs) 在复杂任务中达到顶尖性能具有重要价值。Medprompt 展示了通过使用提示来引发包含思维链推理和集成的运行时策略，一个通用的大语言模型可以在医学等专业领域提供最先进的性能。OpenAI 的 o1-preview 模型代表了一种新范式，其中模型在生成最终响应之前设计为进行运行时推理。我们旨在理解 o1-preview 在一系列多样化的医学挑战问题基准上的行为。继 Medprompt 研究与 GPT-4 之后，我们系统地评估了 o1-preview 模型在各种医学基准上的表现。值得注意的是，即使不使用提示技术，o1-preview 在很大程度上也优于使用 Medprompt 的 GPT-4 系列。我们进一步系统地研究了经典提示工程策略（如 Medprompt）在新推理模型范式中的有效性。我们发现少样本提示阻碍了 o1 的性能，这表明上下文学习可能不再是推理原生模型的有效引导方法。虽然集成仍然可行，但它资源密集且需要仔细的成本-性能优化。我们对运行时策略的成本和准确性分析揭示了一个帕累托前沿，其中 GPT-4o 代表了一个更经济的选择，而 o1-preview 在高成本下实现了最先进的性能。尽管 o1-preview 提供了顶尖性能，但在特定情境下，使用 Medprompt 等引导策略的 GPT-4o 仍然具有价值。此外，我们注意到 o1-preview 模型在许多现有医学基准上已接近饱和，这突显了需要新的、更具挑战性的基准。我们以对大语言模型推理时计算的总体方向的反思作为结束。

[NLP-33] he American Sign Language Knowledge Graph: Infusing ASL Models with Linguistic Knowledge

【速读】：该论文试图解决美国手语（American Sign Language, ASL）在语言技术中的可访问性问题，特别是通过训练模型来实现孤立手语识别（Isolated Sign Recognition, ISR）和ASL到英语的翻译。解决方案的关键在于引入美国手语知识图谱（American Sign Language Knowledge Graph, ASLKG），该图谱整合了来自十二个专家语言学知识源的信息。ASLKG用于训练神经符号模型，以提高模型在ASL理解任务中的泛化能力和可解释性，最终在ISR任务中达到91%的准确率，对未见手语的语义特征预测达到14%，以及对Youtube-ASL视频的主题分类达到36%。

链接: https://arxiv.org/abs/2411.03568
作者: Lee Kezar,Nidhi Munikote,Zian Zeng,Zed Sehyr,Naomi Caselli,Jesse Thomason
关键词-EN: American Sign Language, make language technologies, language technologies substantially, Language Knowledge Graph, American Sign
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Language models for American Sign Language (ASL) could make language technologies substantially more accessible to those who sign. To train models on tasks such as isolated sign recognition (ISR) and ASL-to-English translation, datasets provide annotated video examples of ASL signs. To facilitate the generalizability and explainability of these models, we introduce the American Sign Language Knowledge Graph (ASLKG), compiled from twelve sources of expert linguistic knowledge. We use the ASLKG to train neuro-symbolic models for 3 ASL understanding tasks, achieving accuracies of 91% on ISR, 14% for predicting the semantic features of unseen signs, and 36% for classifying the topic of Youtube-ASL videos.
摘要：针对美国手语 (American Sign Language, ASL) 的语言模型可以使语言技术对那些使用手语的人群更加普及。为了在孤立手语识别 (Isolated Sign Recognition, ISR) 和 ASL 到英语翻译等任务上训练模型，数据集提供了带有注释的 ASL 手语视频示例。为了促进这些模型的泛化能力和可解释性，我们引入了美国手语知识图谱 (American Sign Language Knowledge Graph, ASLKG)，该图谱从十二个专家语言学知识来源中编译而成。我们利用 ASLKG 训练了神经符号模型，用于三个 ASL 理解任务，分别在 ISR 任务上达到了 91% 的准确率，在预测未见手语的语义特征任务上达到了 14% 的准确率，以及在分类 Youtube-ASL 视频主题任务上达到了 36% 的准确率。

[NLP-34] Learning to Write Rationally: How Information Is Distributed in Non-Native Speakers Essays EMNLP2024

【速读】：该论文试图解决的问题是：第二语言学习者在非母语（L2）写作中如何分配信息，以实现更清晰和有效的沟通。解决方案的关键在于通过分析意外性（surprisal）和熵率恒定性（constancy of entropy rate），发现高L2熟练度的写作者能够在减少语言生产预期不确定性的同时，仍然传达有信息量的内容。然而，信息分布的均匀性在不同L2说话者群体中表现出较小的变异性，这表明这一特征可能是L2写作中的普遍现象，较少受到L2写作者的L1背景和L2熟练度的影响。

链接: https://arxiv.org/abs/2411.03550
作者: Zixin Tang,Janet G. van Hell
关键词-EN: People tend, distribute information evenly, clearer communication, distribute information, People
类目: Computation and Language (cs.CL)
备注: To appear in main of Conference on Empirical Methods in Natural Language Processing; EMNLP 2024

点击查看摘要

Abstract:People tend to distribute information evenly in language production for better and clearer communication. In this study, we compared essays written by second language learners with various native language (L1) backgrounds to investigate how they distribute information in their non-native language (L2) production. Analyses of surprisal and constancy of entropy rate indicated that writers with higher L2 proficiency can reduce the expected uncertainty of language production while still conveying informative content. However, the uniformity of information distribution showed less variability among different groups of L2 speakers, suggesting that this feature may be universal in L2 essay writing and less affected by L2 writers’ variability in L1 background and L2 proficiency.
摘要：在语言生成过程中，人们倾向于均匀地分布信息，以实现更好、更清晰的沟通。在本研究中，我们比较了来自不同母语（L1）背景的第二语言学习者所撰写的文章，以探讨他们在非母语（L2）生成中如何分布信息。通过对意外性和熵率恒定性的分析表明，L2 熟练度较高的作者能够在传达信息内容的同时，减少语言生成中的预期不确定性。然而，信息分布的均匀性在不同 L2 说话者群体中表现出较小的差异，这表明这一特征在 L2 文章写作中可能是普遍的，并且较少受到 L2 作者在 L1 背景和 L2 熟练度方面的差异的影响。

[NLP-35] Exploring the Benefits of Domain-Pretraining of Generative Large Language Models for Chemistry

【速读】：该论文试图解决在大规模语言模型（如GPT系列、BLOOM、LLaMA等）在应用于特定科学领域（如化学）时，由于模型在狭窄或专业领域中可能出现的幻觉（hallucinations）或流畅但错误的响应，导致性能下降的问题。解决方案的关键在于通过在特定领域（如化学）进行预训练，并结合指令微调（instruction fine-tuning），以提升模型在化学领域特定任务（如命名实体识别和分子式生成）上的表现。研究结果表明，领域内预训练的模型在零样本（zero-shot）设置下已经表现良好，而进一步的指令微调则显著提升了模型在化学任务上的性能。

链接: https://arxiv.org/abs/2411.03542
作者: Anurag Acharya,Shivam Sharma,Robin Cosbey,Megha Subramanian,Scott Howland,Maria Glenski
关键词-EN: Large Language Models, natural language processing, Large Language, GPT series, proliferation of Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A proliferation of Large Language Models (the GPT series, BLOOM, LLaMA, and more) are driving forward novel development of multipurpose AI for a variety of tasks, particularly natural language processing (NLP) tasks. These models demonstrate strong performance on a range of tasks; however, there has been evidence of brittleness when applied to more niche or narrow domains where hallucinations or fluent but incorrect responses reduce performance. Given the complex nature of scientific domains, it is prudent to investigate the trade-offs of leveraging off-the-shelf versus more targeted foundation models for scientific domains. In this work, we examine the benefits of in-domain pre-training for a given scientific domain, chemistry, and compare these to open-source, off-the-shelf models with zero-shot and few-shot prompting. Our results show that not only do in-domain base models perform reasonably well on in-domain tasks in a zero-shot setting but that further adaptation using instruction fine-tuning yields impressive performance on chemistry-specific tasks such as named entity recognition and molecular formula generation.
摘要：大语言模型（如 GPT 系列、BLOOM、LLaMA 等）的广泛应用正在推动多功能 AI 在多种任务中的创新发展，特别是在自然语言处理 (NLP) 任务中。这些模型在各类任务中表现出色；然而，当应用于更为专业或狭窄的领域时，模型表现出脆弱性，出现幻觉或流畅但错误的响应，从而降低了性能。鉴于科学领域的复杂性，有必要探讨利用现成模型与针对科学领域定制的基础模型之间的权衡。在本研究中，我们考察了针对特定科学领域（化学）进行领域内预训练的益处，并将其与开源的现成模型在零样本和少样本提示下的表现进行比较。结果表明，领域内基础模型不仅在零样本设置下在领域内任务中表现良好，而且通过指令微调进一步适应后，在化学特定任务（如命名实体识别和分子式生成）中取得了显著的性能提升。

[NLP-36] Long Context RAG Performance of Large Language Models NEURIPS

【速读】：该论文试图解决的问题是：在检索增强生成 (Retrieval Augmented Generation, RAG) 场景中，随着大型语言模型 (Large Language Models, LLMs) 支持的上下文长度增加，这些模型是否能够提升RAG性能。解决方案的关键在于系统地研究不同上下文长度（从2,000到128,000甚至2百万个token）对RAG性能的影响，并通过实验评估20个流行的开源和商业LLMs在三个特定领域数据集上的表现。研究发现，虽然检索更多文档可以提升性能，但只有少数最新的最先进LLMs能够在超过64k token的长上下文情况下保持一致的准确性，并揭示了长上下文场景中的特定失败模式，为未来的研究指明了方向。

链接: https://arxiv.org/abs/2411.03538
作者: Quinn Leng,Jacob Portes,Sam Havens,Matei Zaharia,Michael Carbin
关键词-EN: Retrieval Augmented Generation, Large Language Models, Retrieval Augmented, Augmented Generation, Large Language
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 2024 NeurIPS workshop on Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) has emerged as a crucial technique for enhancing the accuracy of Large Language Models (LLMs) by incorporating external information. With the advent of LLMs that support increasingly longer context lengths, there is a growing interest in understanding how these models perform in RAG scenarios. Can these new long context models improve RAG performance? This paper presents a comprehensive study of the impact of increased context length on RAG performance across 20 popular open source and commercial LLMs. We ran RAG workflows while varying the total context length from 2,000 to 128,000 tokens (and 2 million tokens when possible) on three domain-specific datasets, and report key insights on the benefits and limitations of long context in RAG applications. Our findings reveal that while retrieving more documents can improve performance, only a handful of the most recent state of the art LLMs can maintain consistent accuracy at long context above 64k tokens. We also identify distinct failure modes in long context scenarios, suggesting areas for future research.
摘要：检索增强生成 (Retrieval Augmented Generation, RAG) 已成为提升大语言模型 (Large Language Models, LLMs) 准确性的关键技术，通过整合外部信息实现这一目标。随着支持更长上下文长度的 LLMs 的出现，人们越来越关注这些模型在 RAG 场景中的表现。这些新的长上下文模型能否提升 RAG 性能？本文对 20 个流行的开源和商业 LLMs 进行了全面研究，探讨了增加上下文长度对 RAG 性能的影响。我们在三个特定领域的数据集上运行 RAG 工作流程，上下文长度从 2,000 到 128,000 Token（在可能的情况下达到 200 万 Token），并报告了长上下文在 RAG 应用中的优势和局限性的关键见解。我们的研究发现，尽管检索更多文档可以提升性能，但仅有少数最新的最先进 LLMs 能够在上下文长度超过 64k Token 时保持一致的准确性。我们还识别了长上下文场景中的独特失效模式，指出了未来研究的方向。

[NLP-37] Mitigating Metric Bias in Minimum Bayes Risk Decoding

【速读】：该论文试图解决的问题是最小贝叶斯风险解码 (Minimum Bayes Risk, MBR) 在使用神经网络评价指标（如COMET或MetricX）时引入的评价指标偏差 (metric bias)。具体来说，MBR解码旨在生成根据特定效用指标评分高的翻译，但这使得同一指标无法同时用于解码和评估，因为改进可能仅仅是由于奖励机制的操纵而非实际质量的提升。论文的关键解决方案是使用多个效用指标的集成 (ensemble of utility metrics) 进行MBR解码，以减轻评价指标偏差问题。实验结果表明，使用集成效用指标的MBR解码在人类评估中优于单一效用指标的解码。

链接: https://arxiv.org/abs/2411.03524
作者: Geza Kovacs,Daniel Deutsch,Markus Freitag
关键词-EN: Minimum Bayes Risk, Bayes Risk, Minimum Bayes, MBR decoding, COMET or MetricX
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To appear at WMT2024

点击查看摘要

Abstract:While Minimum Bayes Risk (MBR) decoding using metrics such as COMET or MetricX has outperformed traditional decoding methods such as greedy or beam search, it introduces a challenge we refer to as metric bias. As MBR decoding aims to produce translations that score highly according to a specific utility metric, this very process makes it impossible to use the same metric for both decoding and evaluation, as improvements might simply be due to reward hacking rather than reflecting real quality improvements. In this work we find that compared to human ratings, neural metrics not only overestimate the quality of MBR decoding when the same metric is used as the utility metric, but they also overestimate the quality of MBR/QE decoding with other neural utility metrics as well. We also show that the metric bias issue can be mitigated by using an ensemble of utility metrics during MBR decoding: human evaluations show that MBR decoding using an ensemble of utility metrics outperforms a single utility metric.
摘要：尽管使用如 COMET 或 MetricX 等度量的最小贝叶斯风险 (Minimum Bayes Risk, MBR) 解码在性能上超越了传统的解码方法，如贪心搜索或束搜索，但它引入了一个我们称之为度量偏差 (metric bias) 的挑战。由于 MBR 解码旨在生成根据特定效用度量得分高的翻译，这一过程使得无法同时使用相同的度量进行解码和评估，因为改进可能仅仅是由于奖励操纵 (reward hacking) 而非真正质量的提升。在本研究中，我们发现与人类评分相比，神经网络度量不仅在使用相同度量作为效用度量时高估了 MBR 解码的质量，而且在使用其他神经网络效用度量进行 MBR/QE 解码时也高估了其质量。我们还表明，通过在 MBR 解码过程中使用效用度量的集成，可以缓解度量偏差问题：人类评估显示，使用效用度量集成的 MBR 解码优于单一效用度量。

[NLP-38] Change Is the Only Constant: Dynamic LLM Slicing based on Layer Redundancy EMNLP

【速读】：该论文试图解决大型语言模型 (Large Language Models, LLMs) 中的模型压缩问题，特别是通过改进传统的常数切片方法 (constant slicing) 来提高模型的效率和性能。解决方案的关键在于引入了一种动态层级特定剪枝 (dynamic layer-specific pruning) 方法，并提出了一个新的指标——层冗余度 (Layer Redundancy, LR) 分数。该分数通过测量每一层的输入与输出之间的余弦相似度来评估每一层对其输入的变化程度，从而确定哪些部分可以被剪枝。通过动态调整每一层的剪枝比例，使得所有层的平均剪枝比例保持固定值，该方法在多个模型（如Llama3-8B和Mistral-7B）和数据集上的实验结果表明，不仅能够保持甚至提升模型性能，还在某些情况下比传统的常数切片方法提高了多达5%的性能，并且在多个基准测试中观察到高达7%的困惑度 (perplexity) 下降。

链接: https://arxiv.org/abs/2411.03513
作者: Razvan-Gabriel Dumitru,Paul-Ioan Clotan,Vikas Yadav,Darius Peteleaza,Mihai Surdeanu
关键词-EN: Large Language Models, Large Language, dynamic layer-specific pruning, pruning in Large, traditional methodology established
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at EMNLP Findings 2024

点击查看摘要

Abstract:This paper introduces a novel model compression approach through dynamic layer-specific pruning in Large Language Models (LLMs), enhancing the traditional methodology established by SliceGPT. By transitioning from constant to dynamic slicing, our method leverages the newly proposed Layer Redundancy (LR) score, which assesses how much change each layer changes its input by measuring the cosine similarity of the input to the output of the layer. We use this score to prune parts of individual layers based on redundancy in such a way that the average pruned percentage for all layers is a fixed value. We conducted extensive experiments using models like Llama3-8B and Mistral-7B on multiple datasets, evaluating different slicing bases and percentages to determine optimal configurations that balance efficiency and performance. Our findings show that our dynamic slicing approach not only maintains but, in many cases, enhances model performance compared to the baseline established by constant slicing methods. For instance, in several settings, we see performance improvements of up to 5% over the SliceGPT baseline. Additionally, a perplexity decrease by as much as 7% was observed across multiple benchmarks, validating the effectiveness of our method. The code, model weights, and datasets are open-sourced at this https URL.
摘要：本文介绍了一种通过在大语言模型 (LLM) 中进行动态层级特定剪枝的新型模型压缩方法，该方法对 SliceGPT 建立的传统方法进行了改进。通过从固定切片转向动态切片，我们的方法利用了新提出的层冗余 (Layer Redundancy, LR) 评分，该评分通过测量输入与输出之间的余弦相似度来评估每一层对其输入的变化程度。我们使用这一评分来根据冗余度对各层的某些部分进行剪枝，使得所有层的平均剪枝百分比保持固定值。我们使用 Llama3-8B 和 Mistral-7B 等模型在多个数据集上进行了广泛的实验，评估了不同的切片基准和百分比，以确定在效率和性能之间达到平衡的最佳配置。我们的研究结果表明，与固定切片方法建立的基线相比，我们的动态切片方法不仅保持了模型性能，而且在许多情况下还提升了性能。例如，在某些设置中，我们观察到性能比 SliceGPT 基线提高了多达 5%。此外，在多个基准测试中，困惑度降低了多达 7%，验证了我们方法的有效性。代码、模型权重和数据集已在以下链接开源：https URL。

[NLP-39] Uncertainty Quantification for Clinical Outcome Predictions with (Large) Language Models

【速读】：该论文试图解决在临床预测任务中，基于电子健康记录（EHRs）的语言模型（LMs）在不确定性量化方面的挑战。解决方案的关键在于通过多任务学习和集成方法（ensemble methods）来减少模型的不确定性。具体来说，论文首先在白盒模型中量化不确定性，并通过多任务和集成方法有效降低不确定性。随后，将这一方法扩展到黑盒模型，包括流行的商业化语言模型如GPT-4，并通过纵向临床数据验证了其有效性。研究结果表明，集成方法和多任务预测提示在不同场景下均能有效降低不确定性，从而提高了模型在白盒和黑盒设置中的透明度和可靠性。

链接: https://arxiv.org/abs/2411.03497
作者: Zizhang Chen,Peizhao Li,Xiaomeng Dong,Pengyu Hong
关键词-EN: electronic health records, facilitate healthcare delivery, health records, significant potential, electronic health
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To facilitate healthcare delivery, language models (LMs) have significant potential for clinical prediction tasks using electronic health records (EHRs). However, in these high-stakes applications, unreliable decisions can result in high costs due to compromised patient safety and ethical concerns, thus increasing the need for good uncertainty modeling of automated clinical predictions. To address this, we consider the uncertainty quantification of LMs for EHR tasks in white- and black-box settings. We first quantify uncertainty in white-box models, where we can access model parameters and output logits. We show that an effective reduction of model uncertainty can be achieved by using the proposed multi-tasking and ensemble methods in EHRs. Continuing with this idea, we extend our approach to black-box settings, including popular proprietary LMs such as GPT-4. We validate our framework using longitudinal clinical data from more than 6,000 patients in ten clinical prediction tasks. Results show that ensembling methods and multi-task prediction prompts reduce uncertainty across different scenarios. These findings increase the transparency of the model in white-box and black-box settings, thus advancing reliable AI healthcare.
摘要：为了促进医疗服务的提供，语言模型 (LMs) 在利用电子健康记录 (EHRs) 进行临床预测任务方面具有显著潜力。然而，在这些高风险应用中，不可靠的决策可能导致由于患者安全和伦理问题而产生的高成本，因此增加了对自动化临床预测的良好不确定性建模的需求。为了解决这一问题，我们在白盒和黑盒设置中考虑了 EHR 任务的语言模型不确定性量化。我们首先在白盒模型中量化不确定性，其中我们可以访问模型参数和输出 logits。我们展示了通过在 EHRs 中使用提出的多任务和集成方法，可以有效降低模型不确定性。在此基础上，我们将方法扩展到黑盒设置，包括流行的专有语言模型如 GPT-4。我们使用来自 6,000 多名患者的纵向临床数据，在十个临床预测任务中验证了我们的框架。结果显示，集成方法和多任务预测提示在不同场景下降低了不确定性。这些发现增加了白盒和黑盒设置中模型的透明度，从而推动了可靠的 AI 医疗发展。

[NLP-40] Automatic Generation of Question Hints for Mathematics Problems using Large Language Models in Educational Technology NEURIPS2024

【速读】：该论文试图解决在智能辅导系统 (Intelligent Tutoring Systems, ITSs) 中，如何利用大型语言模型 (Large Language Models, LLMs) 自动生成符合教育目标且能有效纠正学生错误的教育提示 (hints) 的问题。解决方案的关键在于：1) 识别模拟学生在中学数学练习中常见的错误模式；2) 设计针对 GPT-4o 作为教师的多种提示，并评估其生成有效提示的能力；3) 通过 Llama-3-8B-Instruct 作为教师进行测试，比较其与 GPT-4o 的表现。研究发现，GPT-4o 生成的提示在针对特定错误和基于常见数学错误的提示中表现最佳，而 Llama-3-8B-Instruct 在整体表现上优于 GPT-4o。此外，提示的温度设置对模型错误率和学生模型的自我修正能力有显著影响。

链接: https://arxiv.org/abs/2411.03495
作者: Junior Cedric Tonga,Benjamin Clement,Pierre-Yves Oudeyer
关键词-EN: Intelligent Tutoring Systems, Large Language Models, Tutoring Systems, Large Language, Intelligent Tutoring
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at NeurIPS 2024 Workshop on Large Foundation Models for Educational Assessment (FM-Assess)

点击查看摘要

Abstract:The automatic generation of hints by Large Language Models (LLMs) within Intelligent Tutoring Systems (ITSs) has shown potential to enhance student learning. However, generating pedagogically sound hints that address student misconceptions and adhere to specific educational objectives remains challenging. This work explores using LLMs (GPT-4o and Llama-3-8B-instruct) as teachers to generate effective hints for students simulated through LLMs (GPT-3.5-turbo, Llama-3-8B-Instruct, or Mistral-7B-instruct-v0.3) tackling math exercises designed for human high-school students, and designed using cognitive science principles. We present here the study of several dimensions: 1) identifying error patterns made by simulated students on secondary-level math exercises; 2) developing various prompts for GPT-4o as a teacher and evaluating their effectiveness in generating hints that enable simulated students to self-correct; and 3) testing the best-performing prompts, based on their ability to produce relevant hints and facilitate error correction, with Llama-3-8B-Instruct as the teacher, allowing for a performance comparison with GPT-4o. The results show that model errors increase with higher temperature settings. Notably, when hints are generated by GPT-4o, the most effective prompts include prompts tailored to specific errors as well as prompts providing general hints based on common mathematical errors. Interestingly, Llama-3-8B-Instruct as a teacher showed better overall performance than GPT-4o. Also the problem-solving and response revision capabilities of the LLMs as students, particularly GPT-3.5-turbo, improved significantly after receiving hints, especially at lower temperature settings. However, models like Mistral-7B-Instruct demonstrated a decline in performance as the temperature increased.
摘要：在智能辅导系统 (Intelligent Tutoring Systems, ITS) 中，利用大语言模型 (Large Language Models, LLMs) 自动生成提示已显示出增强学生学习的潜力。然而，生成符合教育目标且能有效解决学生错误观念的教学提示仍然是一个挑战。本研究探讨了使用 LLMs（GPT-4o 和 Llama-3-8B-instruct）作为教师，为通过 LLMs（GPT-3.5-turbo、Llama-3-8B-Instruct 或 Mistral-7B-instruct-v0.3）模拟的学生生成有效提示，这些学生正在解决为高中生设计的数学练习题，这些练习题的设计基于认知科学原理。我们在此研究了以下几个方面：1) 识别模拟学生在中学数学练习中出现的错误模式；2) 为 GPT-4o 开发多种提示，并评估其在生成使模拟学生能够自我纠正的提示方面的有效性；3) 基于生成相关提示和促进错误纠正的能力，测试表现最佳的提示，并使用 Llama-3-8B-Instruct 作为教师进行测试，以便与 GPT-4o 进行性能比较。结果显示，模型错误率随着温度设置的升高而增加。值得注意的是，当提示由 GPT-4o 生成时，最有效的提示包括针对特定错误的提示以及基于常见数学错误的通用提示。有趣的是，作为教师的 Llama-3-8B-Instruct 总体表现优于 GPT-4o。此外，作为学生的 LLMs 的问题解决和响应修订能力，特别是 GPT-3.5-turbo，在接受提示后显著提高，尤其是在较低温度设置下。然而，像 Mistral-7B-Instruct 这样的模型在温度升高时表现出性能下降。

[NLP-41] LASER: Attention with Exponential Transformation ICLR2025

【速读】：该论文试图解决Transformer模型中基于softmax的点积注意力机制在反向传播过程中梯度信号较弱，导致参数学习效率低下的问题。解决方案的关键是引入了一种名为LASER的新注意力机制，通过理论分析证明其能够传递更大的梯度信号。LASER Attention通过微小的修改即可实现，并在多个任务（包括视觉、文本和语音）中显著提升了模型的泛化性能，例如在Vision Transformer (ViT) 上提高了4.67%的准确率，在Conformer上降低了2.25%的错误率，以及在BERT模型上减少了0.93%的错误预测。

链接: https://arxiv.org/abs/2411.03493
作者: Sai Surya Duvvuri,Inderjit S. Dhillon
关键词-EN: softmax based dot-product, sequence related tasks, based dot-product attention, largely due, sequence related
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 15 pages, under review in ICLR 2025

点击查看摘要

Abstract:Transformers have had tremendous impact for several sequence related tasks, largely due to their ability to retrieve from any part of the sequence via softmax based dot-product attention. This mechanism plays a crucial role in Transformer’s performance. We analyze the gradients backpropagated through the softmax operation in the attention mechanism and observe that these gradients can often be small. This poor gradient signal backpropagation can lead to inefficient learning of parameters preceeding the attention operations. To this end, we introduce a new attention mechanism called LASER, which we analytically show to admit a larger gradient signal. We show that LASER Attention can be implemented by making small modifications to existing attention implementations. We conduct experiments on autoregressive large language models (LLMs) with upto 2.2 billion parameters where we show upto 3.38% and an average of ~1% improvement over standard attention on downstream evaluations. Using LASER gives the following relative improvements in generalization performance across a variety of tasks (vision, text and speech): 4.67% accuracy in Vision Transformer (ViT) on Imagenet, 2.25% error rate in Conformer on the Librispeech speech-to-text and 0.93% fraction of incorrect predictions in BERT with 2.2 billion parameters.
摘要：Transformer 在多个与序列相关的任务中产生了巨大的影响，这主要归功于其通过基于 softmax 的点积注意力机制从序列的任何部分进行检索的能力。这一机制在 Transformer 的性能中起着至关重要的作用。我们分析了通过注意力机制中的 softmax 操作反向传播的梯度，并观察到这些梯度往往较小。这种梯度信号的弱反向传播可能导致注意力操作之前的参数学习效率低下。为此，我们引入了一种新的注意力机制，称为 LASER，我们通过分析表明，该机制允许更大的梯度信号。我们展示了 LASER 注意力可以通过对现有注意力实现进行小修改来实现。我们在具有高达 22 亿参数的自回归大语言模型（LLMs）上进行了实验，结果显示在下游评估中，与标准注意力相比，最高可提升 3.38%，平均提升约 1%。使用 LASER 在各种任务（视觉、文本和语音）上的泛化性能相对提升如下：在 ImageNet 上的 Vision Transformer (ViT) 中准确率提升 4.67%，在 Librispeech 语音转文本任务中的 Conformer 错误率降低 2.25%，以及在具有 22 亿参数的 BERT 中错误预测的比例降低 0.93%。

[NLP-42] LLM Generated Distribution-Based Prediction of US Electoral Results Part I

【速读】：该论文试图解决如何利用大型语言模型（Large Language Models, LLMs）进行预测的问题，并提出了一种基于分布的预测方法。解决方案的关键在于将LLMs的输出标记概率解释为表示模型对世界学习到的分布，从而提供了一种分析算法保真度（algorithmic fidelity）的新视角。这种方法不仅补充了硅采样（silicon sampling）的分析方式，还展示了其在特定任务偏差、提示噪声和算法保真度方面的应用潜力，特别是在最近的美国总统选举预测中。通过这种方法，论文旨在提高LLM预测的可靠性和透明度，适用于多个领域。

链接: https://arxiv.org/abs/2411.03486
作者: Caleb Bradshaw,Caelen Miller,Sean Warnick
关键词-EN: Large Language Models, Language Models, interpreting output token, output token probabilities, models’ learned representation
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages, 10 Figures, Pre-print

点击查看摘要

Abstract:This paper introduces distribution-based prediction, a novel approach to using Large Language Models (LLMs) as predictive tools by interpreting output token probabilities as distributions representing the models’ learned representation of the world. This distribution-based nature offers an alternative perspective for analyzing algorithmic fidelity, complementing the approach used in silicon sampling. We demonstrate the use of distribution-based prediction in the context of recent United States presidential election, showing that this method can be used to determine task specific bias, prompt noise, and algorithmic fidelity. This approach has significant implications for assessing the reliability and increasing transparency of LLM-based predictions across various domains.
摘要：本文介绍了一种基于分布的预测方法，这是一种新颖的使用大语言模型 (LLM) 作为预测工具的途径，通过将输出 Token 的概率解释为表示模型所学世界表示的分布。这种基于分布的特性为分析算法保真度提供了另一种视角，补充了硅采样方法。我们在最近美国总统选举的背景下展示了基于分布的预测方法的应用，表明这种方法可以用于确定任务特定的偏差、提示噪声和算法保真度。这种方法对于评估跨多个领域基于 LLM 的预测的可靠性和提高透明度具有重要意义。

[NLP-43] MetRex: A Benchmark for Verilog Code Metric Reasoning Using LLM s

【速读】：该论文试图解决大型语言模型（LLMs）在硬件设计领域中尚未被应用于后综合（post-synthesis）指标推理和估计的问题。解决方案的关键在于引入了一个名为MetRex的大规模数据集，该数据集包含25,868个Verilog硬件描述语言（HDL）设计和相应的后综合指标（面积、延迟和静态功耗），并采用思维链（Chain of Thought, CoT）模板来增强LLMs对这些指标的推理能力。通过监督微调（Supervised Fine-Tuning, SFT），LLMs在面积、延迟和静态功耗的推理能力分别提升了37.0%、25.3%和25.7%。尽管SFT显著提升了性能，但在复杂问题上仍未达到最优。与最先进的回归模型相比，该方法在5%误差范围内能更准确地预测17.4%更多的设计，并且通过消除预处理步骤实现了1.7倍的加速。

链接: https://arxiv.org/abs/2411.03471
作者: Manar Abdelatty,Jingxiao Ma,Sherief Reda
关键词-EN: Large Language Models, EDA tool scripting, RTL bug fixing, Large Language, EDA tool
类目: Hardware Architecture (cs.AR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been applied to various hardware design tasks, including Verilog code generation, EDA tool scripting, and RTL bug fixing. Despite this extensive exploration, LLMs are yet to be used for the task of post-synthesis metric reasoning and estimation of HDL designs. In this paper, we assess the ability of LLMs to reason about post-synthesis metrics of Verilog designs. We introduce MetRex, a large-scale dataset comprising 25,868 Verilog HDL designs and their corresponding post-synthesis metrics, namely area, delay, and static power. MetRex incorporates a Chain of Thought (CoT) template to enhance LLMs’ reasoning about these metrics. Extensive experiments show that Supervised Fine-Tuning (SFT) boosts the LLM’s reasoning capabilities on average by 37.0%, 25.3%, and 25.7% on the area, delay, and static power, respectively. While SFT improves performance on our benchmark, it remains far from achieving optimal results, especially on complex problems. Comparing to state-of-the-art regression models, our approach delivers accurate post-synthesis predictions for 17.4% more designs (within a 5% error margin), in addition to offering a 1.7x speedup by eliminating the need for pre-processing. This work lays the groundwork for advancing LLM-based Verilog code metric reasoning.
摘要：大语言模型 (LLM) 已被应用于多种硬件设计任务，包括 Verilog 代码生成、EDA 工具脚本编写以及 RTL 错误修复。尽管在这些领域进行了广泛探索，但 LLM 尚未被用于后综合指标推理和 HDL 设计估算。本文评估了 LLM 对 Verilog 设计后综合指标的推理能力。我们引入了 MetRex，这是一个大规模数据集，包含 25,868 个 Verilog HDL 设计及其对应的后综合指标，即面积、延迟和静态功耗。MetRex 结合了思维链 (Chain of Thought, CoT) 模板，以增强 LLM 对这些指标的推理能力。大量实验表明，监督微调 (Supervised Fine-Tuning, SFT) 分别将 LLM 在面积、延迟和静态功耗上的推理能力平均提升了 37.0%、25.3% 和 25.7%。尽管 SFT 提升了在我们的基准测试中的表现，但距离达到最优结果仍有很大差距，尤其是在复杂问题上。与最先进的回归模型相比，我们的方法在 5% 误差范围内，对 17.4% 更多的设计提供了准确的后综合预测，并且通过消除预处理需求，实现了 1.7 倍的加速。这项工作为推进基于 LLM 的 Verilog 代码指标推理奠定了基础。

[NLP-44] Solving Trojan Detection Competitions with Linear Weight Classification

【速读】：该论文试图解决神经网络中隐藏的恶意后门（Trojan backdoors）的检测问题，特别是在无法获取触发数据的情况下。解决方案的关键在于训练一个二分类器，通过对大量模型权重进行预处理，包括特征选择、标准化、参考模型权重减法和模型对齐，从而有效地识别出被污染的模型。该方法在多个现有数据集和领域中表现出色，并通过广泛的实验验证了其在不同Trojan检测基准和领域中的有效性。

链接: https://arxiv.org/abs/2411.03445
作者: Todd Huster,Peter Lin,Razvan Stefanescu,Emmanuel Ekwedike,Ritu Chadha
关键词-EN: conceal malicious Trojan, Neural networks, malicious Trojan backdoors, networks can conceal, conceal malicious
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 Figures

点击查看摘要

Abstract:Neural networks can conceal malicious Trojan backdoors that allow a trigger to covertly change the model behavior. Detecting signs of these backdoors, particularly without access to any triggered data, is the subject of ongoing research and open challenges. In one common formulation of the problem, we are given a set of clean and poisoned models and need to predict whether a given test model is clean or poisoned. In this paper, we introduce a detector that works remarkably well across many of the existing datasets and domains. It is obtained by training a binary classifier on a large number of models’ weights after performing a few different pre-processing steps including feature selection and standardization, reference model weights subtraction, and model alignment prior to detection. We evaluate this algorithm on a diverse set of Trojan detection benchmarks and domains and examine the cases where the approach is most and least effective.
摘要：神经网络可以隐藏恶意特洛伊木马后门，这些后门允许触发器秘密改变模型行为。检测这些后门的迹象，特别是在无法访问任何触发数据的情况下，是当前研究的主题和开放挑战。在问题的一种常见表述中，我们获得了一组干净和被污染的模型，并需要预测给定的测试模型是干净的还是被污染的。本文中，我们引入了一种检测器，它在许多现有数据集和领域中表现出色。该检测器通过在大量模型权重上训练一个二分类器获得，这些模型权重在检测前经过了包括特征选择和标准化、参考模型权重减法以及模型对齐在内的几个预处理步骤。我们在一系列多样化的特洛伊检测基准和领域上评估了该算法，并考察了该方法在效果最佳和最差的情况。

[NLP-45] Usefulness of LLM s as an Author Checklist Assistant for Scientific Papers: NeurIPS24 Experiment

【速读】：该论文试图解决在科学同行评审中使用大型语言模型（LLMs）作为辅助工具的可行性和有效性问题。解决方案的关键在于开发一个基于LLM的“检查表助手”，用于验证会议论文提交是否符合提交标准。具体来说，该助手通过检查作者提交的论文是否符合NeurIPS会议使用的检查表要求，来帮助作者确保其研究及稿件准备标准合规。实验结果表明，该助手在验证检查表完成情况方面总体上是有帮助的，超过70%的作者认为其有用，并表示会根据其反馈修改论文或检查表响应。然而，该研究也揭示了LLMs的一些常见问题，如不准确性和过度严格，以及系统可能被操纵的潜在漏洞。

链接: https://arxiv.org/abs/2411.03417
作者: Alexander Goldberg,Ihsan Ullah,Thanh Gia Hieu Khuong,Benedictus Kent Rachmat,Zhen Xu,Isabelle Guyon,Nihar B. Shah
关键词-EN: Large language models, aiding scientific peer, Large language, Neural Information Processing, scientific peer review
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models (LLMs) represent a promising, but controversial, tool in aiding scientific peer review. This study evaluates the usefulness of LLMs in a conference setting as a tool for vetting paper submissions against submission standards. We conduct an experiment at the 2024 Neural Information Processing Systems (NeurIPS) conference, where 234 papers were voluntarily submitted to an “LLM-based Checklist Assistant.” This assistant validates whether papers adhere to the author checklist used by NeurIPS, which includes questions to ensure compliance with research and manuscript preparation standards. Evaluation of the assistant by NeurIPS paper authors suggests that the LLM-based assistant was generally helpful in verifying checklist completion. In post-usage surveys, over 70% of authors found the assistant useful, and 70% indicate that they would revise their papers or checklist responses based on its feedback. While causal attribution to the assistant is not definitive, qualitative evidence suggests that the LLM contributed to improving some submissions. Survey responses and analysis of re-submissions indicate that authors made substantive revisions to their submissions in response to specific feedback from the LLM. The experiment also highlights common issues with LLMs: inaccuracy (20/52) and excessive strictness (14/52) were the most frequent issues flagged by authors. We also conduct experiments to understand potential gaming of the system, which reveal that the assistant could be manipulated to enhance scores through fabricated justifications, highlighting potential vulnerabilities of automated review tools.
摘要：大语言模型（LLMs）作为一种辅助科学同行评审的工具，具有潜力，但也颇具争议。本研究评估了 LLMs 在会议环境中作为审核论文提交是否符合提交标准的工具的有效性。我们在 2024 年神经信息处理系统（NeurIPS）会议上进行了一项实验，共有 234 篇论文自愿提交给一个“基于 LLM 的检查表助手”。该助手验证了论文是否遵循 NeurIPS 使用的作者检查表，该检查表包含确保研究及稿件准备标准合规的问题。通过对 NeurIPS 论文作者的评估，结果表明基于 LLM 的助手在验证检查表完成情况方面总体上是有效的。在使用后的调查中，超过 70% 的作者认为该助手有用，并且 70% 的作者表示会根据其反馈修改论文或检查表回答。尽管对助手的因果归因尚不明确，但定性证据表明 LLM 有助于改进部分提交内容。调查回复和重新提交的分析显示，作者根据 LLM 的具体反馈对提交内容进行了实质性修改。实验还突显了 LLMs 的常见问题：不准确性（20/52）和过度严格（14/52）是作者最常报告的问题。我们还进行了实验以了解系统可能被操纵的情况，结果显示助手可能通过编造的理由来提高评分，这突显了自动化评审工具的潜在漏洞。

[NLP-46] SAUCE: Synchronous and Asynchronous User-Customizable Environment for Multi-Agent LLM Interaction

【速读】：该论文试图解决在复杂社会环境中，如政治辩论等群体互动场景中，如何模拟和研究多个参与者之间的动态交互问题。解决方案的关键在于提出了SAUCE平台，这是一个可定制的Python平台，允许研究人员通过配置文件轻松集成和操作多个大型语言模型（LLMs），以模拟任意主题的讨论。SAUCE平台的核心功能包括模型的实例化、响应调度、讨论历史的管理以及生成综合输出日志，其创新之处在于引入了异步通信功能，使模型不仅能决定说什么，还能决定何时发言，从而更真实地模拟人类沟通的重要方面。

链接: https://arxiv.org/abs/2411.03397
作者: Shlomo Neuberger,Niv Eckhaus,Uri Berger,Amir Taubenfeld,Gabriel Stanovsky,Ariel Goldstein
关键词-EN: political debates, arbitrarily many participants, views and agendas, complex social settings, customizable Python platform
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: this https URL

点击查看摘要

Abstract:Many human interactions, such as political debates, are carried out in group settings, where there are arbitrarily many participants, each with different views and agendas. To explore such complex social settings, we present SAUCE: a customizable Python platform, allowing researchers to plug-and-play various LLMs participating in discussions on any topic chosen by the user. Our platform takes care of instantiating the models, scheduling their responses, managing the discussion history, and producing a comprehensive output log, all customizable through configuration files, requiring little to no coding skills. A novel feature of SAUCE is our asynchronous communication feature, where models decide when to speak in addition to what to say, thus modeling an important facet of human communication. We show SAUCE’s attractiveness in two initial experiments, and invite the community to use it in simulating various group simulations.
摘要：许多人类互动，如政治辩论，是在群体环境中进行的，其中参与者数量不限，每个参与者都有不同的观点和议程。为了探索这种复杂的社会环境，我们提出了 SAUCE：一个可定制的 Python 平台，允许研究人员即插即用地接入各种大语言模型 (LLM)，参与用户选择的任何主题的讨论。我们的平台负责实例化模型、调度它们的响应、管理讨论历史，并生成全面的输出日志，所有这些都可以通过配置文件进行定制，几乎不需要编程技能。SAUCE 的一个新颖功能是我们的异步通信功能，模型不仅决定说什么，还决定何时发言，从而模拟了人类沟通的一个重要方面。我们在两个初步实验中展示了 SAUCE 的吸引力，并邀请社区使用它来模拟各种群体互动。

[NLP-47] Exploring Large Language Models for Specialist-level Oncology Care

【速读】：该论文试图解决大型语言模型（LLMs）在乳腺肿瘤学这一专业领域中的应用问题，特别是在未经过专门微调的情况下，评估其在复杂医疗场景中的表现。解决方案的关键在于开发了一个名为AMIE的对话式诊断AI系统，并通过以下几个方面提升其性能：1) 使用50个合成乳腺癌病例进行评估，这些病例涵盖了治疗初诊和治疗难治性病例，模拟了多学科肿瘤委员会决策所需的关键信息；2) 制定详细的临床评分标准，评估管理计划的各个方面，包括病例总结质量、治疗方案的安全性以及化疗、放疗、手术和激素治疗的建议；3) 在推理时引入网络搜索功能，以获取最新的临床知识并优化响应；4) 采用多阶段自我批评流程来进一步完善响应。通过这些改进，AMIE在与内科住院医师、肿瘤学研究员和普通肿瘤科主治医师的对比中表现优异，但在与资深肿瘤科主治医师的比较中仍显不足，表明在该领域仍需进一步研究。

链接: https://arxiv.org/abs/2411.03395
作者: Anil Palepu,Vikram Dhillon,Polly Niravath,Wei-Hung Weng,Preethi Prasad,Khaled Saab,Ryutaro Tanno,Yong Cheng,Hanh Mai,Ethan Burns,Zainub Ajmal,Kavita Kulkarni,Philip Mansfield,Dale Webster,Joelle Barral,Juraj Gottweis,Mike Schaekermann,S. Sara Mahdavi,Vivek Natarajan,Alan Karthikesalingam,Tao Tu
关键词-EN: Large language models, shown remarkable progress, Large language, complex medical queries, complex medical
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable progress in encoding clinical knowledge and responding to complex medical queries with appropriate clinical reasoning. However, their applicability in subspecialist or complex medical settings remains underexplored. In this work, we probe the performance of AMIE, a research conversational diagnostic AI system, in the subspecialist domain of breast oncology care without specific fine-tuning to this challenging domain. To perform this evaluation, we curated a set of 50 synthetic breast cancer vignettes representing a range of treatment-naive and treatment-refractory cases and mirroring the key information available to a multidisciplinary tumor board for decision-making (openly released with this work). We developed a detailed clinical rubric for evaluating management plans, including axes such as the quality of case summarization, safety of the proposed care plan, and recommendations for chemotherapy, radiotherapy, surgery and hormonal therapy. To improve performance, we enhanced AMIE with the inference-time ability to perform web search retrieval to gather relevant and up-to-date clinical knowledge and refine its responses with a multi-stage self-critique pipeline. We compare response quality of AMIE with internal medicine trainees, oncology fellows, and general oncology attendings under both automated and specialist clinician evaluations. In our evaluations, AMIE outperformed trainees and fellows demonstrating the potential of the system in this challenging and important domain. We further demonstrate through qualitative examples, how systems such as AMIE might facilitate conversational interactions to assist clinicians in their decision making. However, AMIE’s performance was overall inferior to attending oncologists suggesting that further research is needed prior to consideration of prospective uses.
摘要：大语言模型（LLMs）在编码临床知识和通过适当的临床推理响应复杂医疗查询方面展示了显著的进步。然而，其在专科或复杂医疗环境中的适用性仍未得到充分探索。在本研究中，我们探究了AMIE（一种研究性对话诊断AI系统）在未针对这一具有挑战性的领域进行特定微调的情况下，在乳腺肿瘤学专科领域的性能。为了进行这项评估，我们精心挑选了50个合成乳腺癌病例，涵盖了治疗未接受者和治疗难治者的多种情况，并反映了多学科肿瘤委员会决策时可获得的关键信息（随本研究公开发布）。我们制定了一个详细的临床评分标准来评估管理计划，包括病例总结质量、所提议护理计划的安全性以及化疗、放疗、手术和激素治疗的建议等维度。为了提升性能，我们增强了AMIE在推理时进行网络搜索以获取相关且最新的临床知识的能力，并通过多阶段自我批评管道来优化其响应。我们将AMIE的响应质量与内科住院医师、肿瘤学研究员和普通肿瘤科主治医师的响应进行了比较，评估方式包括自动化评估和专科临床医师评估。在我们的评估中，AMIE的表现优于住院医师和研究员，展示了该系统在这一具有挑战性和重要领域的潜力。我们还通过定性示例进一步展示了AMIE等系统如何促进对话交互以辅助临床医师的决策过程。然而，AMIE的整体表现仍不及肿瘤科主治医师，这表明在考虑实际应用之前，仍需进一步的研究。

[NLP-48] A Comprehensive Survey of Small Language Models in the Era of Large Language Models : Techniques Enhancements Applications Collaboration with LLM s and Trustworthiness

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在参数规模大、计算需求高、隐私问题突出、特定领域表现不佳等问题。解决方案的关键在于推广和标准化小型语言模型（Small Language Models, SLMs），这些模型具有低推理延迟、成本效益高、易于开发和定制等优势，特别适用于资源受限的环境和领域知识获取。论文提出了通过定义SLMs的能力范围和适用性来标准化其定义，并提供了一个分类框架和通用开发框架，以有效增强和利用SLMs。

链接: https://arxiv.org/abs/2411.03350
作者: Fali Wang,Zhiwei Zhang,Xianren Zhang,Zongyu Wu,Tzuhao Mo,Qiuhao Lu,Wanjing Wang,Rui Li,Junjie Xu,Xianfeng Tang,Qi He,Yao Ma,Ming Huang,Suhang Wang
关键词-EN: Large language models, Small Language Models, question answering, domain knowledge acquisition, text generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 76 pages, 26 figures, 14 tables

点击查看摘要

Abstract:Large language models (LLM) have demonstrated emergent abilities in text generation, question answering, and reasoning, facilitating various tasks and domains. Despite their proficiency in various tasks, LLMs like LaPM 540B and Llama-3.1 405B face limitations due to large parameter sizes and computational demands, often requiring cloud API use which raises privacy concerns, limits real-time applications on edge devices, and increases fine-tuning costs. Additionally, LLMs often underperform in specialized domains such as healthcare and law due to insufficient domain-specific knowledge, necessitating specialized models. Therefore, Small Language Models (SLMs) are increasingly favored for their low inference latency, cost-effectiveness, efficient development, and easy customization and adaptability. These models are particularly well-suited for resource-limited environments and domain knowledge acquisition, addressing LLMs’ challenges and proving ideal for applications that require localized data handling for privacy, minimal inference latency for efficiency, and domain knowledge acquisition through lightweight fine-tuning. The rising demand for SLMs has spurred extensive research and development. However, a comprehensive survey investigating issues related to the definition, acquisition, application, enhancement, and reliability of SLM remains lacking, prompting us to conduct a detailed survey on these topics. The definition of SLMs varies widely, thus to standardize, we propose defining SLMs by their capability to perform specialized tasks and suitability for resource-constrained settings, setting boundaries based on the minimal size for emergent abilities and the maximum size sustainable under resource constraints. For other aspects, we provide a taxonomy of relevant models/methods and develop general frameworks for each category to enhance and utilize SLMs effectively.
摘要：大语言模型（LLM）在文本生成、问答和推理等任务中展现了涌现能力，促进了多种任务和领域的发展。尽管这些模型如LaPM 540B和Llama-3.1 405B在多项任务中表现出色，但由于参数规模庞大和计算需求高，它们面临着诸多限制，通常需要使用云API，这引发了隐私问题，限制了边缘设备上的实时应用，并增加了微调成本。此外，大语言模型在医疗和法律等专业领域的表现往往不尽如人意，因为它们缺乏足够的领域特定知识，这需要专门的模型来弥补。因此，小型语言模型（SLM）因其低推理延迟、成本效益、高效的开发过程以及易于定制和适应性而越来越受到青睐。这些模型特别适合资源受限的环境和领域知识获取，能够解决大语言模型面临的挑战，并证明在需要本地化数据处理以保护隐私、最小化推理延迟以提高效率以及通过轻量级微调获取领域知识的应用中是理想的选择。随着对小型语言模型的需求不断增加，相关研究和开发也得到了广泛推动。然而，关于小型语言模型的定义、获取、应用、增强和可靠性等方面的全面调查仍然缺乏，这促使我们对此进行详细调查。小型语言模型的定义各异，因此为了标准化，我们建议根据其执行专门任务的能力和适合资源受限环境的特点来定义小型语言模型，并基于涌现能力的最小规模和资源约束下的最大可持续规模来设定界限。对于其他方面，我们提供了一个相关模型/方法的分类法，并为每个类别开发了通用框架，以有效增强和利用小型语言模型。

[NLP-49] RuAG: Learned-rule-augmented Generation for Large Language Models

【速读】：该论文试图解决大语言模型（LLMs）在推理过程中由于上下文窗口大小限制而导致的外部知识注入不足的问题。解决方案的关键在于提出了一种名为RuAG的新框架，该框架能够自动将大量离线数据提炼成可解释的一阶逻辑规则（first-order logic rules），并将这些规则注入到LLMs中以增强其推理能力。具体步骤包括：首先，利用LLMs的常识知识来定义搜索过程中的头部和体部谓词；接着，采用蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）来处理组合搜索空间，从而高效地从数据中发现逻辑规则；最后，将这些逻辑规则翻译成自然语言，以便进行有针对性的知识注入，并将其无缝集成到LLM的提示中，用于下游任务的推理。

链接: https://arxiv.org/abs/2411.03349
作者: Yudi Zhang,Pei Xiao,Lu Wang,Chaoyun Zhang,Meng Fang,Yali Du,Yevgeniy Puzyrev,Randolph Yao,Si Qin,Qingwei Lin,Mykola Pechenizkiy,Dongmei Zhang,Saravan Rajmohan,Qi Zhang
关键词-EN: contextual window size, limited contextual window, In-context learning, Retrieval-Augmented Generation, incorporating external knowledge
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In-context learning (ICL) and Retrieval-Augmented Generation (RAG) have gained attention for their ability to enhance LLMs’ reasoning by incorporating external knowledge but suffer from limited contextual window size, leading to insufficient information injection. To this end, we propose a novel framework, RuAG, to automatically distill large volumes of offline data into interpretable first-order logic rules, which are injected into LLMs to boost their reasoning capabilities. Our method begins by formulating the search process relying on LLMs’ commonsense, where LLMs automatically define head and body predicates. Then, RuAG applies Monte Carlo Tree Search (MCTS) to address the combinational searching space and efficiently discover logic rules from data. The resulting logic rules are translated into natural language, allowing targeted knowledge injection and seamless integration into LLM prompts for LLM’s downstream task reasoning. We evaluate our framework on public and private industrial tasks, including natural language processing, time-series, decision-making, and industrial tasks, demonstrating its effectiveness in enhancing LLM’s capability over diverse tasks.
摘要：上下文学习 (In-context learning) 和检索增强生成 (Retrieval-Augmented Generation, RAG) 因其能够通过整合外部知识来增强大语言模型 (LLM) 的推理能力而受到关注，但它们受限于有限的上下文窗口大小，导致信息注入不足。为此，我们提出了一种新颖的框架——RuAG，该框架能够自动将大量离线数据提炼成可解释的一阶逻辑规则，并将这些规则注入到 LLM 中以提升其推理能力。我们的方法首先依赖于 LLM 的常识来制定搜索过程，其中 LLM 自动定义头部和主体谓词。接着，RuAG 应用蒙特卡洛树搜索 (Monte Carlo Tree Search, MCTS) 来处理组合搜索空间，并从数据中高效地发现逻辑规则。生成的逻辑规则被转化为自然语言，从而实现有针对性的知识注入，并能无缝集成到 LLM 的提示中，以支持 LLM 在下游任务中的推理。我们在公开和私有的工业任务中评估了我们的框架，包括自然语言处理、时间序列、决策制定和工业任务，结果表明该框架在提升 LLM 在多样化任务中的能力方面具有显著效果。

[NLP-50] What Features in Prompts Jailbreak LLM s? Investigating the Mechanisms Behind Attacks

【速读】：该论文试图解决大语言模型（LLMs）的安全性和可靠性研究中，关于“越狱”攻击机制的理解问题。解决方案的关键在于比较线性和非线性方法，以研究提示中导致成功越狱的特征。通过仅基于提示标记对应的潜在表示部分来探测越狱成功与否，研究发现不同的越狱方法依赖于提示中的不同非线性特征。尽管探测方法能够高度准确地区分成功和不成功的越狱提示，但它们在未见过的攻击方法上的迁移性较差。此外，非线性探测方法可以用于机制性地越狱LLM，通过指导对抗性潜在扰动的设计，实现比训练时使用的35种技术中的34种更可靠的越狱。最终，研究结果表明，越狱不能仅通过普遍或线性的提示特征来完全理解。

链接: https://arxiv.org/abs/2411.03343
作者: Nathalie Maria Kirch,Severin Field,Stephen Casper
关键词-EN: large language models, large language, central to research, safety and reliability, underlying mechanisms
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While `jailbreaks’ have been central to research on the safety and reliability of LLMs (large language models), the underlying mechanisms behind these attacks are not well understood. Some prior works have used linear methods to analyze jailbreak prompts or model refusal. Here, however, we compare linear and nonlinear methods to study the features in prompts that contribute to successful jailbreaks. We do this by probing for jailbreak success based only on the portions of the latent representations corresponding to prompt tokens. First, we introduce a dataset of 10,800 jailbreak attempts from 35 attack methods. We then show that different jailbreaking methods work via different nonlinear features in prompts. Specifically, we find that while probes can distinguish between successful and unsuccessful jailbreaking prompts with a high degree of accuracy, they often transfer poorly to held-out attack methods. We also show that nonlinear probes can be used to mechanistically jailbreak the LLM by guiding the design of adversarial latent perturbations. These mechanistic jailbreaks are able to jailbreak Gemma-7B-IT more reliably than 34 of the 35 techniques that it was trained on. Ultimately, our results suggest that jailbreaks cannot be thoroughly understood in terms of universal or linear prompt features alone.
摘要：尽管“越狱”（jailbreaks）一直是研究大语言模型（LLM, Large Language Model）安全性和可靠性的核心问题，但这些攻击背后的机制尚未被充分理解。一些先前的研究使用线性方法来分析越狱提示或模型拒绝的情况。然而，本文通过比较线性与非线性方法，探讨了提示中导致成功越狱的特征。我们通过仅基于与提示Token对应的潜在表示部分来探测越狱成功与否。首先，我们引入了一个包含10,800次越狱尝试的数据集，这些尝试来自35种攻击方法。随后，我们展示了不同的越狱方法通过提示中的不同非线性特征起作用。具体而言，我们发现尽管探测器能够以高准确度区分成功与不成功的越狱提示，但它们在应用于未见过的攻击方法时表现不佳。我们还展示了非线性探测器可以通过指导对抗性潜在扰动的设计来实现对大语言模型的机制性越狱。这些机制性越狱能够比Gemma-7B-IT训练时所用的35种技术中的34种更可靠地越狱。最终，我们的研究结果表明，仅凭普遍或线性提示特征无法全面理解越狱现象。

[NLP-51] Unlocking the Archives: Using Large Language Models to Transcribe Handwritten Historical Documents

【速读】：该论文试图解决历史手写文档的高效、准确转录问题。解决方案的关键在于利用大型语言模型 (Large Language Models, LLMs) 进行自动转录和校正，相较于传统的专用手写文本识别 (Handwritten Text Recognition, HTR) 软件，LLMs 在字符错误率 (Character Error Rate, CER) 和词错误率 (Word Error Rate, WER) 上分别提升了14%和32%，并且在速度和成本上具有显著优势。论文中提出的开源软件工具 Transcription Pearl 整合了来自 OpenAI、Anthropic 和 Google 的商业多模态 LLMs，能够在批量处理手写文档时实现接近人类水平的准确性，同时大幅提高转录效率和降低成本。

链接: https://arxiv.org/abs/2411.03340
作者: Mark Humphries,Lianne C. Leddy,Quinn Downton,Meredith Legace,John McConnell,Isabella Murray,Elizabeth Spence
关键词-EN: Large Language Models, Handwritten Text Recognition, Language Models, Character Error Rates, Word Error Rates
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Digital Libraries (cs.DL); Machine Learning (cs.LG)
备注: 29 Pages, 11 Tables, 2 Figures

点击查看摘要

Abstract:This study demonstrates that Large Language Models (LLMs) can transcribe historical handwritten documents with significantly higher accuracy than specialized Handwritten Text Recognition (HTR) software, while being faster and more cost-effective. We introduce an open-source software tool called Transcription Pearl that leverages these capabilities to automatically transcribe and correct batches of handwritten documents using commercially available multimodal LLMs from OpenAI, Anthropic, and Google. In tests on a diverse corpus of 18th/19th century English language handwritten documents, LLMs achieved Character Error Rates (CER) of 5.7 to 7% and Word Error Rates (WER) of 8.9 to 15.9%, improvements of 14% and 32% respectively over specialized state-of-the-art HTR software like Transkribus. Most significantly, when LLMs were then used to correct those transcriptions as well as texts generated by conventional HTR software, they achieved near-human levels of accuracy, that is CERs as low as 1.8% and WERs of 3.5%. The LLMs also completed these tasks 50 times faster and at approximately 1/50th the cost of proprietary HTR programs. These results demonstrate that when LLMs are incorporated into software tools like Transcription Pearl, they provide an accessible, fast, and highly accurate method for mass transcription of historical handwritten documents, significantly streamlining the digitization process.
摘要：本研究展示了大型语言模型（Large Language Models, LLMs）在转录历史手写文档时，其准确性显著高于专用手写文本识别（Handwritten Text Recognition, HTR）软件，同时速度更快且更具成本效益。我们引入了一款名为Transcription Pearl的开源软件工具，该工具利用这些能力，通过使用OpenAI、Anthropic和Google等公司提供的商用多模态LLMs，自动转录和校正批量手写文档。在对18/19世纪英语手写文档的多样化语料库进行测试时，LLMs实现了5.7%至7%的字符错误率（Character Error Rates, CER）和8.9%至15.9%的单词错误率（Word Error Rates, WER），相较于Transkribus等专用最先进的HTR软件，分别提高了14%和32%。最重要的是，当LLMs用于校正这些转录文本以及传统HTR软件生成的文本时，它们达到了接近人类水平的准确性，即CER低至1.8%，WER为3.5%。此外，LLMs完成这些任务的速度是专有HTR程序的50倍，成本约为后者的1/50。这些结果表明，当LLMs被整合到Transcription Pearl等软件工具中时，它们提供了一种便捷、快速且高度准确的大规模转录历史手写文档的方法，显著简化了数字化过程。

[NLP-52] Will Trump Win in 2024? Predicting the US Presidential Election via Multi-step Reasoning with Large Language Models

【速读】：该论文试图解决大型语言模型（LLMs）在预测选举结果方面的能力问题。解决方案的关键在于引入了一个多步骤推理框架，专门用于政治分析。该框架通过整合多源数据，包括美国国家选举研究（ANES）的实际数据和合成人物数据，来建模选民行为。为了捕捉时间动态，模型还纳入了候选人的政策立场和传记细节，确保模型能够适应不断变化的政治环境。通过使用思维链提示（Chain of Thought prompting），该框架系统地整合了人口统计学、意识形态和时间依赖性因素，从而增强了模型的预测能力。此外，该框架还被应用于预测2024年美国总统选举的结果，展示了LLMs对未见政治数据的适应性。

链接: https://arxiv.org/abs/2411.03321
作者: Chenxiao Yu,Zhaotian Weng,Zheng Li,Xiyang Hu,Yue Zhao
关键词-EN: Large Language Models, Large Language, Language Models, accurately predict election, National Election Studies
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: This research is ongoing work. Xiyang Hu and Yue Zhao are the corresponding authors

点击查看摘要

Abstract:Can Large Language Models (LLMs) accurately predict election outcomes? While LLMs have demonstrated impressive performance in various domains, including healthcare, legal analysis, and creative tasks, their ability to forecast elections remains unknown. Election prediction poses unique challenges, such as limited voter-level data, rapidly changing political landscapes, and the need to model complex human behavior. To address these challenges, we introduce a multi-step reasoning framework designed for political analysis. Our approach is validated on real-world data from the American National Election Studies (ANES) 2016 and 2020, as well as synthetic personas generated by the leading machine learning framework, offering scalable datasets for voter behavior modeling. To capture temporal dynamics, we incorporate candidates’ policy positions and biographical details, ensuring that the model adapts to evolving political contexts. Drawing on Chain of Thought prompting, our multi-step reasoning pipeline systematically integrates demographic, ideological, and time-dependent factors, enhancing the model’s predictive power. Additionally, we apply our framework to predict the outcome of the 2024 U.S. presidential election in advance, demonstrating the adaptability of LLMs to unseen political data.
摘要：大语言模型 (LLM) 能否准确预测选举结果？尽管 LLM 在医疗、法律分析和创意任务等多个领域展示了令人印象深刻的表现，但其预测选举的能力仍未可知。选举预测面临独特的挑战，如有限的选民级别数据、快速变化的政治环境和复杂人类行为的建模需求。为应对这些挑战，我们提出了一种专为政治分析设计的多步骤推理框架。我们的方法通过美国国家选举研究 (ANES) 2016 和 2020 年的真实数据以及由领先机器学习框架生成的合成人物数据进行验证，提供了可扩展的选民行为建模数据集。为捕捉时间动态，我们整合了候选人的政策立场和传记细节，确保模型适应不断变化的政治环境。借鉴思维链提示 (Chain of Thought prompting)，我们的多步骤推理流程系统地整合了人口统计、意识形态和时间依赖因素，增强了模型的预测能力。此外，我们应用该框架提前预测 2024 年美国大选结果，展示了 LLM 对未见政治数据的适应性。

[NLP-53] Multilingual Large Language Models and Curse of Multilinguality

【速读】：该论文试图解决多语言大型语言模型（Multilingual Large Language Models, LLMs）的技术概览问题，并探讨其局限性及应对策略。解决方案的关键在于全面介绍多语言LLMs的底层架构、目标函数、预训练数据源和分词方法，以及不同模型类型的独特特征，如仅编码器模型（mBERT, XLM-R）、仅解码器模型（XGLM, PALM, BLOOM, GPT-3）和编码器-解码器模型（mT5, mBART）。此外，论文还重点讨论了多语言LLMs面临的主要限制——多语言诅咒（curse of multilinguality），并探讨了当前克服这一问题的尝试。

链接: https://arxiv.org/abs/2406.10602
作者: Daniil Gurgurov,Tanja Bäumel,Tatiana Anikina
关键词-EN: Natural Language Processing, gained large popularity, Multilingual Large Language, Large Language Models, Language Processing
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Multilingual Large Language Models (LLMs) have gained large popularity among Natural Language Processing (NLP) researchers and practitioners. These models, trained on huge datasets, show proficiency across various languages and demonstrate effectiveness in numerous downstream tasks. This paper navigates the landscape of multilingual LLMs, providing an introductory overview of their technical aspects. It explains underlying architectures, objective functions, pre-training data sources, and tokenization methods. This work explores the unique features of different model types: encoder-only (mBERT, XLM-R), decoder-only (XGLM, PALM, BLOOM, GPT-3), and encoder-decoder models (mT5, mBART). Additionally, it addresses one of the significant limitations of multilingual LLMs - the curse of multilinguality - and discusses current attempts to overcome it.
摘要：多语言大语言模型（Multilingual Large Language Models, LLMs）在自然语言处理（Natural Language Processing, NLP）研究者和从业者中获得了极大的关注。这些模型通过在庞大的数据集上进行训练，展示了在多种语言中的熟练度，并在众多下游任务中表现出色。本文探讨了多语言大语言模型的技术概况，提供了对其技术层面的入门级概述。文章解释了其底层架构、目标函数、预训练数据来源以及Token化方法。本文还探讨了不同模型类型的独特特征：仅编码器模型（如 mBERT, XLM-R）、仅解码器模型（如 XGLM, PALM, BLOOM, GPT-3）以及编码器-解码器模型（如 mT5, mBART）。此外，文章还讨论了多语言大语言模型的一个重要局限性——多语言诅咒（curse of multilinguality），并探讨了当前克服这一问题的尝试。

人工智能

[AI-0] Fed-EC: Bandwidth-Efficient Clustering-Based Federated Learning For Autonomous Visual Robot Navigation

链接: https://arxiv.org/abs/2411.04112
作者: Shreya Gummadi,Mateus V. Gasparino,Deepak Vasisht,Girish Chowdhary
关键词-EN: poses significant challenges, central server, bandwidth consumption, poses significant, privacy and bandwidth
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Centralized learning requires data to be aggregated at a central server, which poses significant challenges in terms of data privacy and bandwidth consumption. Federated learning presents a compelling alternative, however, vanilla federated learning methods deployed in robotics aim to learn a single global model across robots that works ideally for all. But in practice one model may not be well suited for robots deployed in various environments. This paper proposes Federated-EmbedCluster (Fed-EC), a clustering-based federated learning framework that is deployed with vision based autonomous robot navigation in diverse outdoor environments. The framework addresses the key federated learning challenge of deteriorating model performance of a single global model due to the presence of non-IID data across real-world robots. Extensive real-world experiments validate that Fed-EC reduces the communication size by 23x for each robot while matching the performance of centralized learning for goal-oriented navigation and outperforms local learning. Fed-EC can transfer previously learnt models to new robots that join the cluster.

[AI-1] RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models NEURIPS2024

链接: https://arxiv.org/abs/2411.04097
作者: Maya Varma,Jean-Benoit Delbrouck,Zhihong Chen,Akshay Chaudhari,Curtis Langlotz
关键词-EN: degraded zero-shot performance, spurious correlations, image features, textual attributes, resulting in degraded
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Fine-tuned vision-language models (VLMs) often capture spurious correlations between image features and textual attributes, resulting in degraded zero-shot performance at test time. Existing approaches for addressing spurious correlations (i) primarily operate at the global image-level rather than intervening directly on fine-grained image features and (ii) are predominantly designed for unimodal settings. In this work, we present RaVL, which takes a fine-grained perspective on VLM robustness by discovering and mitigating spurious correlations using local image features rather than operating at the global image level. Given a fine-tuned VLM, RaVL first discovers spurious correlations by leveraging a region-level clustering approach to identify precise image features contributing to zero-shot classification errors. Then, RaVL mitigates the identified spurious correlation with a novel region-aware loss function that enables the VLM to focus on relevant regions and ignore spurious relationships during fine-tuning. We evaluate RaVL on 654 VLMs with various model architectures, data domains, and learned spurious correlations. Our results show that RaVL accurately discovers (191% improvement over the closest baseline) and mitigates (8.2% improvement on worst-group image classification accuracy) spurious correlations. Qualitative evaluations on general-domain and medical-domain VLMs confirm our findings.

[AI-2] Non-Stationary Learning of Neural Networks with Automatic Soft Parameter Reset

链接: https://arxiv.org/abs/2411.04034
作者: Alexandre Galashov,Michalis K. Titsias,András György,Clare Lyle,Razvan Pascanu,Yee Whye Teh,Maneesh Sahani
关键词-EN: Neural networks, networks are traditionally, traditionally trained, Neural, stationary distribution
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Neural networks are traditionally trained under the assumption that data come from a stationary distribution. However, settings which violate this assumption are becoming more popular; examples include supervised learning under distributional shifts, reinforcement learning, continual learning and non-stationary contextual bandits. In this work we introduce a novel learning approach that automatically models and adapts to non-stationarity, via an Ornstein-Uhlenbeck process with an adaptive drift parameter. The adaptive drift tends to draw the parameters towards the initialisation distribution, so the approach can be understood as a form of soft parameter reset. We show empirically that our approach performs well in non-stationary supervised and off-policy reinforcement learning settings.

[AI-3] Predicting and Publishing Accurate Imbalance Prices Using Monte Carlo Tree Search

链接: https://arxiv.org/abs/2411.04011
作者: Fabio Pavirani,Jonas Van Gompel,Seyed Soroush Karimi Madahi,Bert Claessens,Chris Develder
关键词-EN: renewable energy sources, energy sources, solar and wind, uncontrollable production, growing reliance
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growing reliance on renewable energy sources, particularly solar and wind, has introduced challenges due to their uncontrollable production. This complicates maintaining the electrical grid balance, prompting some transmission system operators in Western Europe to implement imbalance tariffs that penalize unsustainable power deviations. These tariffs create an implicit demand response framework to mitigate grid instability. Yet, several challenges limit active participation. In Belgium, for example, imbalance prices are only calculated at the end of each 15-minute settlement period, creating high risk due to price uncertainty. This risk is further amplified by the inherent volatility of imbalance prices, discouraging participation. Although transmission system operators provide minute-based price predictions, the system imbalance volatility makes accurate price predictions challenging to obtain and requires sophisticated techniques. Moreover, publishing price estimates can prompt participants to adjust their schedules, potentially affecting the system balance and the final price, adding further complexity. To address these challenges, we propose a Monte Carlo Tree Search method that publishes accurate imbalance prices while accounting for potential response actions. Our approach models the system dynamics using a neural network forecaster and a cluster of virtual batteries controlled by reinforcement learning agents. Compared to Belgium’s current publication method, our technique improves price accuracy by 20.4% under ideal conditions and by 12.8% in more realistic scenarios. This research addresses an unexplored, yet crucial problem, positioning this paper as a pioneering work in analyzing the potential of more advanced imbalance price publishing techniques.

[AI-4] Aligning Characteristic Descriptors with Images for Human-Expert-like Explainability

链接: https://arxiv.org/abs/2411.04008
作者: Bharat Chandra Yalavarthi,Nalini Ratha
关键词-EN: supporting informed decision-making, ensuring user trust, deep learning models, informed decision-making, interpret the outputs
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In mission-critical domains such as law enforcement and medical diagnosis, the ability to explain and interpret the outputs of deep learning models is crucial for ensuring user trust and supporting informed decision-making. Despite advancements in explainability, existing methods often fall short in providing explanations that mirror the depth and clarity of those given by human experts. Such expert-level explanations are essential for the dependable application of deep learning models in law enforcement and medical contexts. Additionally, we recognize that most explanations in real-world scenarios are communicated primarily through natural language. Addressing these needs, we propose a novel approach that utilizes characteristic descriptors to explain model decisions by identifying their presence in images, thereby generating expert-like explanations. Our method incorporates a concept bottleneck layer within the model architecture, which calculates the similarity between image and descriptor encodings to deliver inherent and faithful explanations. Through experiments in face recognition and chest X-ray diagnosis, we demonstrate that our approach offers a significant contrast over existing techniques, which are often limited to the use of saliency maps. We believe our approach represents a significant step toward making deep learning systems more accountable, transparent, and trustworthy in the critical domains of face recognition and medical diagnosis.

[AI-5] Select2Plan: Training-Free ICL-Based Planning through VQA and Memory Retrieval

链接: https://arxiv.org/abs/2411.04006
作者: Davide Buoso,Luke Robinson,Giuseppe Averta,Philip Torr,Tim Franzmeyer,Daniele De Martini
关键词-EN: high-level robot planning, study explores, explores the potential, high-level robot, robot planning
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study explores the potential of off-the-shelf Vision-Language Models (VLMs) for high-level robot planning in the context of autonomous navigation. Indeed, while most of existing learning-based approaches for path planning require extensive task-specific training/fine-tuning, we demonstrate how such training can be avoided for most practical cases. To do this, we introduce Select2Plan (S2P), a novel training-free framework for high-level robot planning which completely eliminates the need for fine-tuning or specialised training. By leveraging structured Visual Question-Answering (VQA) and In-Context Learning (ICL), our approach drastically reduces the need for data collection, requiring a fraction of the task-specific data typically used by trained models, or even relying only on online data. Our method facilitates the effective use of a generally trained VLM in a flexible and cost-efficient way, and does not require additional sensing except for a simple monocular camera. We demonstrate its adaptability across various scene types, context sources, and sensing setups. We evaluate our approach in two distinct scenarios: traditional First-Person View (FPV) and infrastructure-driven Third-Person View (TPV) navigation, demonstrating the flexibility and simplicity of our method. Our technique significantly enhances the navigational capabilities of a baseline VLM of approximately 50% in TPV scenario, and is comparable to trained models in the FPV one, with as few as 20 demonstrations.

[AI-6] ParaGAN: A Scalable Distributed Training Framework for Generative Adversarial Networks SOCC

链接: https://arxiv.org/abs/2411.03999
作者: Ziji Shi,Jialin Li,Yang You
关键词-EN: Generative Adversarial Networks, involving Generative Adversarial, Adversarial Networks, Generative Adversarial, fueled numerous applications
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注: Accepted at ACM Symposium on Cloud Computing (SoCC) 2024

点击查看摘要

Abstract:Recent advances in Generative Artificial Intelligence have fueled numerous applications, particularly those involving Generative Adversarial Networks (GANs), which are essential for synthesizing realistic photos and videos. However, efficiently training GANs remains a critical challenge due to their computationally intensive and numerically unstable nature. Existing methods often require days or even weeks for training, posing significant resource and time constraints. In this work, we introduce ParaGAN, a scalable distributed GAN training framework that leverages asynchronous training and an asymmetric optimization policy to accelerate GAN training. ParaGAN employs a congestion-aware data pipeline and hardware-aware layout transformation to enhance accelerator utilization, resulting in over 30% improvements in throughput. With ParaGAN, we reduce the training time of BigGAN from 15 days to 14 hours while achieving 91% scaling efficiency. Additionally, ParaGAN enables unprecedented high-resolution image generation using BigGAN. Comments: Accepted at ACM Symposium on Cloud Computing (SoCC) 2024 Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.03999 [cs.DC] (or arXiv:2411.03999v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2411.03999 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-7] owards Resource-Efficient Federated Learning in Industrial IoT for Multivariate Time Series Analysis

链接: https://arxiv.org/abs/2411.03996
作者: Alexandros Gkillas,Aris Lalos
关键词-EN: missing data constitute, industrial applications, constitute a thorny, anomaly detection, learning enabled anomaly
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Anomaly and missing data constitute a thorny problem in industrial applications. In recent years, deep learning enabled anomaly detection has emerged as a critical direction, however the improved detection accuracy is achieved with the utilization of large neural networks, increasing their storage and computational cost. Moreover, the data collected in edge devices contain user privacy, introducing challenges that can be successfully addressed by the privacy-preserving distributed paradigm, known as federated learning (FL). This framework allows edge devices to train and exchange models increasing also the communication cost. Thus, to deal with the increased communication, processing and storage challenges of the FL based deep anomaly detection NN pruning is expected to have significant benefits towards reducing the processing, storage and communication complexity. With this focus, a novel compression-based optimization problem is proposed at the server-side of a FL paradigm that fusses the received local models broadcast and performs pruning generating a more compressed model. Experiments in the context of anomaly detection and missing value imputation demonstrate that the proposed FL scenario along with the proposed compressed-based method are able to achieve high compression rates (more than 99.7% ) with negligible performance losses (less than 1.18% ) as compared to the centralized solutions.

[AI-8] Energy Score-based Pseudo-Label Filtering and Adaptive Loss for Imbalanced Semi-supervised SAR target recognition

链接: https://arxiv.org/abs/2411.03959
作者: Xinzheng Zhang,Yuqing Luo,Guopeng Li
关键词-EN: synthetic aperture radar, Automatic target recognition, semi-supervised SAR ATR, SAR ATR, SAR ATR technology
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Automatic target recognition (ATR) is an important use case for synthetic aperture radar (SAR) image interpretation. Recent years have seen significant advancements in SAR ATR technology based on semi-supervised learning. However, existing semi-supervised SAR ATR algorithms show low recognition accuracy in the case of class imbalance. This work offers a non-balanced semi-supervised SAR target recognition approach using dynamic energy scores and adaptive loss. First, an energy score-based method is developed to dynamically select unlabeled samples near to the training distribution as pseudo-labels during training, assuring pseudo-label reliability in long-tailed distribution circumstances. Secondly, loss functions suitable for class imbalances are proposed, including adaptive margin perception loss and adaptive hard triplet loss, the former offsets inter-class confusion of classifiers, alleviating the imbalance issue inherent in pseudo-label generation. The latter effectively tackles the model’s preference for the majority class by focusing on complex difficult samples during training. Experimental results on extremely imbalanced SAR datasets demonstrate that the proposed method performs well under the dual constraints of scarce labels and data imbalance, effectively overcoming the model bias caused by data imbalance and achieving high-precision target recognition.

[AI-9] Fine-Grained Guidance for Retrievers: Leveraging LLM s Feedback in Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2411.03957
作者: Yuhang Liu,Xueyu Hu,Shengyu Zhang,Jingyuan Chen,Fan Wu,Fei Wu
关键词-EN: mitigating hallucination issues, hallucination issues inherent, Retrieval-Augmented Generation, large language models, mitigating hallucination
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 13 pages, 4 figures

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has proven to be an effective method for mitigating hallucination issues inherent in large language models (LLMs). Previous approaches typically train retrievers based on semantic similarity, lacking optimization for RAG. More recent works have proposed aligning retrievers with the preference signals of LLMs. However, these preference signals are often difficult for dense retrievers, which typically have weaker language capabilities, to understand and learn effectively. Drawing inspiration from pedagogical theories like Guided Discovery Learning, we propose a novel framework, FiGRet (Fine-grained Guidance for Retrievers), which leverages the language capabilities of LLMs to construct examples from a more granular, information-centric perspective to guide the learning of retrievers. Specifically, our method utilizes LLMs to construct easy-to-understand examples from samples where the retriever performs poorly, focusing on three learning objectives highly relevant to the RAG scenario: relevance, comprehensiveness, and purity. These examples serve as scaffolding to ultimately align the retriever with the LLM’s preferences. Furthermore, we employ a dual curriculum learning strategy and leverage the reciprocal feedback between LLM and retriever to further enhance the performance of the RAG system. A series of experiments demonstrate that our proposed framework enhances the performance of RAG systems equipped with different retrievers and is applicable to various LLMs.

[AI-10] Long-Form Text-to-Music Generation with Adaptive Prompts: A Case of Study in Tabletop Role-Playing Games Soundtracks

链接: https://arxiv.org/abs/2411.03948
作者: Felipe Marra,Lucas N. Ferreira
关键词-EN: Tabletop Role-Playing Games, Role-Playing Games, Tabletop Role-Playing, Large Language Models, producing long-form music
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Neural and Evolutionary Computing (cs.NE); Audio and Speech Processing (eess.AS)
*备注: Paper accepted at the LAMIR 2024 workshop

点击查看摘要

Abstract:This paper investigates the capabilities of text-to-audio music generation models in producing long-form music with prompts that change over time, focusing on soundtrack generation for Tabletop Role-Playing Games (TRPGs). We introduce Babel Bardo, a system that uses Large Language Models (LLMs) to transform speech transcriptions into music descriptions for controlling a text-to-music model. Four versions of Babel Bardo were compared in two TRPG campaigns: a baseline using direct speech transcriptions, and three LLM-based versions with varying approaches to music description generation. Evaluations considered audio quality, story alignment, and transition smoothness. Results indicate that detailed music descriptions improve audio quality while maintaining consistency across consecutive descriptions enhances story alignment and transition smoothness.

[AI-11] Can Custom Models Learn In-Context? An Exploration of Hybrid Architecture Performance on In-Context Learning Tasks

链接: https://arxiv.org/abs/2411.03945
作者: Ryan Campbell,Nelson Lojo,Kesava Viswanadha,Christoffer Grondal Tryggestad,Derrick Han Sun,Sriteja Vijapurapu,August Rolfsen,Anant Sahai
关键词-EN: task learning occurs, learning occurs, parameter updates, necessity of parameter, Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 18 pages, 16 figures

点击查看摘要

Abstract:In-Context Learning (ICL) is a phenomenon where task learning occurs through a prompt sequence without the necessity of parameter updates. ICL in Multi-Headed Attention (MHA) with absolute positional embedding has been the focus of more study than other sequence model varieties. We examine implications of architectural differences between GPT-2 and LLaMa as well as LlaMa and Mamba. We extend work done by Garg et al. (2022) and Park et al. (2024) to GPT-2/LLaMa hybrid and LLaMa/Mamba hybrid models - examining the interplay between sequence transformation blocks and regressive performance in-context. We note that certain architectural changes cause degraded training efficiency/ICL accuracy by converging to suboptimal predictors or converging slower. We also find certain hybrids showing optimistic performance improvements, informing potential future ICL-focused architecture modifications. Additionally, we propose the “ICL regression score”, a scalar metric describing a model’s whole performance on a specific task. Compute limitations impose restrictions on our architecture-space, training duration, number of training runs, function class complexity, and benchmark complexity. To foster reproducible and extensible research, we provide a typed, modular, and extensible Python package on which we run all experiments.

[AI-12] Fine-tuning – a Transfer Learning approach

链接: https://arxiv.org/abs/2411.03941
作者: Joseph Arul Raj,Linglong Qian,Zina Ibrahim
关键词-EN: Electronic Health Records, Health Records, Electronic Health, Secondary research, valuable resource
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Secondary research use of Electronic Health Records (EHRs) is often hampered by the abundance of missing data in this valuable resource. Missingness in EHRs occurs naturally as a result of the data recording practices during routine clinical care, but handling it is crucial to the precision of medical analysis and the decision-making that follows. The literature contains a variety of imputation methodologies based on deep neural networks. Those aim to overcome the dynamic, heterogeneous and multivariate missingness patterns of EHRs, which cannot be handled by classical and statistical imputation methods. However, all existing deep imputation methods rely on end-to-end pipelines that incorporate both imputation and downstream analyses, e.g. classification. This coupling makes it difficult to assess the quality of imputation and takes away the flexibility of re-using the imputer for a different task. Furthermore, most end-to-end deep architectures tend to use complex networks to perform the downstream task, in addition to the already sophisticated deep imputation network. We, therefore ask if the high performance reported in the literature is due to the imputer or the classifier and further ask if an optimised state-of-the-art imputer is used, a simpler classifier can achieve comparable performance. This paper explores the development of a modular, deep learning-based imputation and classification pipeline, specifically built to leverage the capabilities of state-of-the-art imputation models for downstream classification tasks. Such a modular approach enables a) objective assessment of the quality of the imputer and classifier independently, and b) enables the exploration of the performance of simpler classification architectures using an optimised imputer.

[AI-13] OML: Open Monetizable and Loyal AI

链接: https://arxiv.org/abs/2411.03887
作者: Zerui Cheng,Edoardo Contente,Ben Finch,Oleg Golev,Jonathan Hayase,Andrew Miller,Niusha Moshrefi,Anshul Nasery,Sandeep Nailwal,Sewoong Oh,Himanshu Tyagi,Pramod Viswanath
关键词-EN: Artificial General Intelligence, create Artificial General, Artificial Intelligence, General Intelligence, Artificial General
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 60 pages, 22 figures

点击查看摘要

Abstract:Artificial Intelligence (AI) has steadily improved across a wide range of tasks. However, the development and deployment of AI are almost entirely controlled by a few powerful organizations that are racing to create Artificial General Intelligence (AGI). The centralized entities make decisions with little public oversight, shaping the future of humanity, often with unforeseen consequences. In this paper, we propose OML, which stands for Open, Monetizable, and Loyal AI, an approach designed to democratize AI development. OML is realized through an interdisciplinary framework spanning AI, blockchain, and cryptography. We present several ideas for constructing OML using technologies such as Trusted Execution Environments (TEE), traditional cryptographic primitives like fully homomorphic encryption and functional encryption, obfuscation, and AI-native solutions rooted in the sample complexity and intrinsic hardness of AI tasks. A key innovation of our work is introducing a new scientific field: AI-native cryptography. Unlike conventional cryptography, which focuses on discrete data and binary security guarantees, AI-native cryptography exploits the continuous nature of AI data representations and their low-dimensional manifolds, focusing on improving approximate performance. One core idea is to transform AI attack methods, such as data poisoning, into security tools. This novel approach serves as a foundation for OML 1.0 which uses model fingerprinting to protect the integrity and ownership of AI models. The spirit of OML is to establish a decentralized, open, and transparent platform for AI development, enabling the community to contribute, monetize, and take ownership of AI models. By decentralizing control and ensuring transparency through blockchain technology, OML prevents the concentration of power and provides accountability in AI development that has not been possible before.

[AI-14] Disability data futures: Achievable imaginaries for AI and disability data justice

链接: https://arxiv.org/abs/2411.03885
作者: Denis Newman-Griffis,Bonnielin Swenor,Rupa Valdez,Gillian Mason
关键词-EN: mediating between people, individuals’ identities, filtered in contemporary, contemporary states, increasingly the layer
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data are the medium through which individuals’ identities and experiences are filtered in contemporary states and systems, and AI is increasingly the layer mediating between people, data, and decisions. The history of data and AI is often one of disability exclusion, oppression, and the reduction of disabled experience; left unchallenged, the current proliferation of AI and data systems thus risks further automating ableism behind the veneer of algorithmic neutrality. However, exclusionary histories do not preclude inclusive futures, and disability-led visions can chart new paths for collective action to achieve futures founded in disability justice. This chapter brings together four academics and disability advocates working at the nexus of disability, data, and AI, to describe achievable imaginaries for artificial intelligence and disability data justice. Reflecting diverse contexts, disciplinary perspectives, and personal experiences, we draw out the shape, actors, and goals of imagined future systems where data and AI support movement towards disability justice.

[AI-15] AdaSociety: An Adaptive Environment with Social Structures for Multi-Agent Decision-Making NEURIPS

链接: https://arxiv.org/abs/2411.03865
作者: Yizhe Huang,Xingbo Wang,Hao Liu,Fanqi Kong,Aoyang Qin,Min Tang,Xiaoxi Wang,Song-Chun Zhu,Mingjie Bi,Siyuan Qi,Xue Feng
关键词-EN: Traditional interactive environments, Traditional interactive, interactive environments limit, environments limit agents’, limit agents’ intelligence
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Accepted at NeurIPS DB 2024

点击查看摘要

Abstract:Traditional interactive environments limit agents’ intelligence growth with fixed tasks. Recently, single-agent environments address this by generating new tasks based on agent actions, enhancing task diversity. We consider the decision-making problem in multi-agent settings, where tasks are further influenced by social connections, affecting rewards and information access. However, existing multi-agent environments lack a combination of adaptive physical surroundings and social connections, hindering the learning of intelligent behaviors. To address this, we introduce AdaSociety, a customizable multi-agent environment featuring expanding state and action spaces, alongside explicit and alterable social structures. As agents progress, the environment adaptively generates new tasks with social structures for agents to undertake. In AdaSociety, we develop three mini-games showcasing distinct social structures and tasks. Initial results demonstrate that specific social structures can promote both individual and collective benefits, though current reinforcement learning and LLM-based algorithms show limited effectiveness in leveraging social structures to enhance performance. Overall, AdaSociety serves as a valuable research platform for exploring intelligence in diverse physical and social settings. The code is available at this https URL.

[AI-16] ROBIN: Robust and Invisible Watermarks for Diffusion Models with Adversarial Optimization NEURIPS2024

链接: https://arxiv.org/abs/2411.03862
作者: Huayang Huang,Yu Wu,Qian Wang
关键词-EN: generative content serves, Watermarking generative content, ownership protection, tool for authentication, potential misuse
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: Accept to NeurIPS 2024

点击查看摘要

Abstract:Watermarking generative content serves as a vital tool for authentication, ownership protection, and mitigation of potential misuse. Existing watermarking methods face the challenge of balancing robustness and concealment. They empirically inject a watermark that is both invisible and robust and passively achieve concealment by limiting the strength of the watermark, thus reducing the robustness. In this paper, we propose to explicitly introduce a watermark hiding process to actively achieve concealment, thus allowing the embedding of stronger watermarks. To be specific, we implant a robust watermark in an intermediate diffusion state and then guide the model to hide the watermark in the final generated image. We employ an adversarial optimization algorithm to produce the optimal hiding prompt guiding signal for each watermark. The prompt embedding is optimized to minimize artifacts in the generated image, while the watermark is optimized to achieve maximum strength. The watermark can be verified by reversing the generation process. Experiments on various diffusion models demonstrate the watermark remains verifiable even under significant image tampering and shows superior invisibility compared to other state-of-the-art robust watermarking methods.

[AI-17] UniTraj: Universal Human Trajectory Modeling from Billion-Scale Worldwide Traces

链接: https://arxiv.org/abs/2411.03859
作者: Yuanshao Zhu,James Jianqiao Yu,Xiangyu Zhao,Xuetao Wei,Yuxuan Liang
关键词-EN: deciphering movement patterns, universal human trajectory, Human trajectory modeling, human trajectory foundation, supporting advanced applications
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph)
*备注:

点击查看摘要

Abstract:Human trajectory modeling is essential for deciphering movement patterns and supporting advanced applications across various domains. However, existing methods are often tailored to specific tasks and regions, resulting in limitations related to task specificity, regional dependency, and data quality sensitivity. Addressing these challenges requires a universal human trajectory foundation model capable of generalizing and scaling across diverse tasks and geographic contexts. To this end, we propose UniTraj, a Universal human Trajectory foundation model that is task-adaptive, region-independent, and highly generalizable. To further enhance performance, we construct WorldTrace, the first large-scale, high-quality, globally distributed dataset sourced from open web platforms, encompassing 2.45 million trajectories with billions of points across 70 countries. Through multiple resampling and masking strategies designed for pre-training, UniTraj effectively overcomes geographic and task constraints, adapting to heterogeneous data quality. Extensive experiments across multiple trajectory analysis tasks and real-world datasets demonstrate that UniTraj consistently outperforms existing approaches in terms of scalability and adaptability. These results underscore the potential of UniTraj as a versatile, robust solution for a wide range of trajectory analysis applications, with WorldTrace serving as an ideal but non-exclusive foundation for training.

[AI-18] A Novel Access Control and Privacy-Enhancing Approach for Models in Edge Computing

链接: https://arxiv.org/abs/2411.03847
作者: Peihao Li
关键词-EN: deep learning models, grown more acute, widespread adoption, increasing prevalence, prevalence of deep
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the widespread adoption of edge computing technologies and the increasing prevalence of deep learning models in these environments, the security risks and privacy threats to models and data have grown more acute. Attackers can exploit various techniques to illegally obtain models or misuse data, leading to serious issues such as intellectual property infringement and privacy breaches. Existing model access control technologies primarily rely on traditional encryption and authentication methods; however, these approaches exhibit significant limitations in terms of flexibility and adaptability in dynamic environments. Although there have been advancements in model watermarking techniques for marking model ownership, they remain limited in their ability to proactively protect intellectual property and prevent unauthorized access. To address these challenges, we propose a novel model access control method tailored for edge computing environments. This method leverages image style as a licensing mechanism, embedding style recognition into the model’s operational framework to enable intrinsic access control. Consequently, models deployed on edge platforms are designed to correctly infer only on license data with specific style, rendering them ineffective on any other data. By restricting the input data to the edge model, this approach not only prevents attackers from gaining unauthorized access to the model but also enhances the privacy of data on terminal devices. We conducted extensive experiments on benchmark datasets, including MNIST, CIFAR-10, and FACESCRUB, and the results demonstrate that our method effectively prevents unauthorized access to the model while maintaining accuracy. Additionally, the model shows strong resistance against attacks such as forged licenses and fine-tuning. These results underscore the method’s usability, security, and robustness.

[AI-19] Reconsidering the Performance of GAE in Link Prediction

链接: https://arxiv.org/abs/2411.03845
作者: Weishuo Ma,Yanbo Wang,Xiyuan Wang,Muhan Zhang
关键词-EN: link prediction tasks, advanced training techniques, graph neural networks, neural networks, prediction tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Various graph neural networks (GNNs) with advanced training techniques and model designs have been proposed for link prediction tasks. However, outdated baseline models may lead to an overestimation of the benefits provided by these novel approaches. To address this, we systematically investigate the potential of Graph Autoencoders (GAE) by meticulously tuning hyperparameters and utilizing the trick of orthogonal embedding and linear propagation. Our findings reveal that a well-optimized GAE can match the performance of more complex models while offering greater computational efficiency.

[AI-20] Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination

链接: https://arxiv.org/abs/2411.03823
作者: Dingjie Song,Sicheng Lai,Shunian Chen,Lichao Sun,Benyou Wang
关键词-EN: demonstrated superior performance, large language models, rapid progression, demonstrated superior, language models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid progression of multimodal large language models (MLLMs) has demonstrated superior performance on various multimodal benchmarks. However, the issue of data contamination during training creates challenges in performance evaluation and comparison. While numerous methods exist for detecting dataset contamination in large language models (LLMs), they are less effective for MLLMs due to their various modalities and multiple training phases. In this study, we introduce a multimodal data contamination detection framework, MM-Detect, designed for MLLMs. Our experimental results indicate that MM-Detect is sensitive to varying degrees of contamination and can highlight significant performance improvements due to leakage of the training set of multimodal benchmarks. Furthermore, We also explore the possibility of contamination originating from the pre-training phase of LLMs used by MLLMs and the fine-tuning phase of MLLMs, offering new insights into the stages at which contamination may be introduced.

[AI-21] Beyond The Rainbow: High Performance Deep Reinforcement Learning On A Desktop PC ICLR

链接: https://arxiv.org/abs/2411.03820
作者: Tyler Clark,Mark Towers,Christine Evers,Jonathon Hare
关键词-EN: Rainbow Deep Q-Network, demonstrated combining multiple, combining multiple independent, multiple independent enhancements, Deep Q-Network
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 main pages, 26 total. Currently under review at ICLR

点击查看摘要

Abstract:Rainbow Deep Q-Network (DQN) demonstrated combining multiple independent enhancements could significantly boost a reinforcement learning (RL) agent’s performance. In this paper, we present “Beyond The Rainbow” (BTR), a novel algorithm that integrates six improvements from across the RL literature to Rainbow DQN, establishing a new state-of-the-art for RL using a desktop PC, with a human-normalized interquartile mean (IQM) of 7.4 on atari-60. Beyond Atari, we demonstrate BTR’s capability to handle complex 3D games, successfully training agents to play Super Mario Galaxy, Mario Kart, and Mortal Kombat with minimal algorithmic changes. Designing BTR with computational efficiency in mind, agents can be trained using a desktop PC on 200 million Atari frames within 12 hours. Additionally, we conduct detailed ablation studies of each component, analzying the performance and impact using numerous measures.

[AI-22] GS2Pose: Tow-stage 6D Object Pose Estimation Guided by Gaussian Splatting

链接: https://arxiv.org/abs/2411.03807
作者: Jilan Mei,Junbo Li,Cai Meng
关键词-EN: accurate and robust, paper proposes, method for accurate, requires segmented RGBD, segmented RGBD images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper proposes a new method for accurate and robust 6D pose estimation of novel objects, named GS2Pose. By introducing 3D Gaussian splatting, GS2Pose can utilize the reconstruction results without requiring a high-quality CAD model, which means it only requires segmented RGBD images as input. Specifically, GS2Pose employs a two-stage structure consisting of coarse estimation followed by refined estimation. In the coarse stage, a lightweight U-Net network with a polarization attention mechanism, called Pose-Net, is designed. By using the 3DGS model for supervised training, Pose-Net can generate NOCS images to compute a coarse pose. In the refinement stage, GS2Pose formulates a pose regression algorithm following the idea of reprojection or Bundle Adjustment (BA), referred to as GS-Refiner. By leveraging Lie algebra to extend 3DGS, GS-Refiner obtains a pose-differentiable rendering pipeline that refines the coarse pose by comparing the input images with the rendered images. GS-Refiner also selectively updates parameters in the 3DGS model to achieve environmental adaptation, thereby enhancing the algorithm’s robustness and flexibility to illuminative variation, occlusion, and other challenging disruptive factors. GS2Pose was evaluated through experiments conducted on the LineMod dataset, where it was compared with similar algorithms, yielding highly competitive results. The code for GS2Pose will soon be released on GitHub.

[AI-23] Overcoming label shift in targeted federated learning

链接: https://arxiv.org/abs/2411.03799
作者: Edvin Listo Zec,Adam Breitholtz,Fredrik D. Johansson
关键词-EN: enables multiple actors, learning enables multiple, sharing private data, collaboratively train models, Federated learning enables
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated learning enables multiple actors to collaboratively train models without sharing private data. This unlocks the potential for scaling machine learning to diverse applications. Existing algorithms for this task are well-justified when clients and the intended target domain share the same distribution of features and labels, but this assumption is often violated in real-world scenarios. One common violation is label shift, where the label distributions differ across clients or between clients and the target domain, which can significantly degrade model performance. To address this problem, we propose FedPALS, a novel model aggregation scheme that adapts to label shifts by leveraging knowledge of the target label distribution at the central server. Our approach ensures unbiased updates under stochastic gradient descent, ensuring robust generalization across clients with diverse, label-shifted data. Extensive experiments on image classification demonstrate that FedPALS consistently outperforms standard baselines by aligning model aggregation with the target domain. Our findings reveal that conventional federated learning methods suffer severely in cases of extreme client sparsity, highlighting the critical need for target-aware aggregation. FedPALS offers a principled and practical solution to mitigate label distribution mismatch, ensuring models trained in federated settings can generalize effectively to label-shifted target domains.

[AI-24] VQA2:Visual Question Answering for Video Quality Assessment

链接: https://arxiv.org/abs/2411.03795
作者: Ziheng Jia,Zicheng Zhang,Jiaying Qian,Haoning Wu,Wei Sun,Chunyi Li,Xiaohong Liu,Weisi Lin,Guangtao Zhai,Xiongkuo Min
关键词-EN: video-related computer vision, computer vision fields, Quality, large multi-modal models, visual question answering
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages 3 figures

点击查看摘要

Abstract:The advent and proliferation of large multi-modal models (LMMs) have introduced a new paradigm to video-related computer vision fields, including training and inference methods based on visual question answering (VQA). These methods enable models to handle multiple downstream tasks robustly. Video Quality Assessment (VQA), a classic field in low-level visual quality evaluation, originally focused on quantitative video quality scoring. However, driven by advances in LMMs, it is now evolving towards more comprehensive visual quality understanding tasks. Visual question answering has significantly improved low-level visual evaluation within the image domain recently. However, related work is almost nonexistent in the video domain, leaving substantial room for improvement. To address this gap, we introduce the VQA2 Instruction Dataset the first visual question answering instruction dataset entirely focuses on video quality assessment, and based on it, we propose the VQA2 series models The VQA2 Instruction Dataset consists of three stages and covers various video types, containing 157,735 instruction question-answer pairs, including both manually annotated and synthetic data. We conduct extensive experiments on both video quality scoring and video quality understanding tasks. Results demonstrate that the VQA2 series models achieve state-of-the-art (SOTA) performance in quality scoring tasks, and their performance in visual quality question answering surpasses the renowned GPT-4o. Additionally, our final model, the VQA2-Assistant, performs well across both scoring and question-answering tasks, validating its versatility.

[AI-25] Navigating the landscape of multimodal AI in medicine: a scoping review on technical challenges and clinical applications

链接: https://arxiv.org/abs/2411.03782
作者: Daan Schouten,Giulia Nicoletti,Bas Dille,Catherine Chia,Pierpaolo Vendittelli,Megan Schuurmans,Geert Litjens,Nadieh Khalili
关键词-EN: Recent technological advances, Recent technological, patient data quantity, quantity and diversity, technological advances
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 28 pages

点击查看摘要

Abstract:Recent technological advances in healthcare have led to unprecedented growth in patient data quantity and diversity. While artificial intelligence (AI) models have shown promising results in analyzing individual data modalities, there is increasing recognition that models integrating multiple complementary data sources, so-called multimodal AI, could enhance clinical decision-making. This scoping review examines the landscape of deep learning-based multimodal AI applications across the medical domain, analyzing 432 papers published between 2018 and 2024. We provide an extensive overview of multimodal AI development across different medical disciplines, examining various architectural approaches, fusion strategies, and common application areas. Our analysis reveals that multimodal AI models consistently outperform their unimodal counterparts, with an average improvement of 6.2 percentage points in AUC. However, several challenges persist, including cross-departmental coordination, heterogeneous data characteristics, and incomplete datasets. We critically assess the technical and practical challenges in developing multimodal AI systems and discuss potential strategies for their clinical implementation, including a brief overview of commercially available multimodal AI models for clinical decision-making. Additionally, we identify key factors driving multimodal AI development and propose recommendations to accelerate the field’s maturation. This review provides researchers and clinicians with a thorough understanding of the current state, challenges, and future directions of multimodal AI in medicine.

[AI-26] Content-Style Learning from Unaligned Domains: Identifiability under Unknown Latent Dimensions

链接: https://arxiv.org/abs/2411.03755
作者: Sagar Shrestha,Xiao Fu
关键词-EN: unaligned multi-domain data, content and style, data generation, Understanding identifiability, essential for tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Understanding identifiability of latent content and style variables from unaligned multi-domain data is essential for tasks such as domain translation and data generation. Existing works on content-style identification were often developed under somewhat stringent conditions, e.g., that all latent components are mutually independent and that the dimensions of the content and style variables are known. We introduce a new analytical framework via cross-domain \textitlatent distribution matching (LDM), which establishes content-style identifiability under substantially more relaxed conditions. Specifically, we show that restrictive assumptions such as component-wise independence of the latent variables can be removed. Most notably, we prove that prior knowledge of the content and style dimensions is not necessary for ensuring identifiability, if sparsity constraints are properly imposed onto the learned latent representations. Bypassing the knowledge of the exact latent dimension has been a longstanding aspiration in unsupervised representation learning – our analysis is the first to underpin its theoretical and practical viability. On the implementation side, we recast the LDM formulation into a regularized multi-domain GAN loss with coupled latent variables. We show that the reformulation is equivalent to LDM under mild conditions – yet requiring considerably less computational resource. Experiments corroborate with our theoretical claims.

[AI-27] Optimal Defenses Against Gradient Reconstruction Attacks

链接: https://arxiv.org/abs/2411.03746
作者: Yuxiao Chen,Gamze Gürsoy,Qi Lei
关键词-EN: Federated Learning, centralized data storage, prevent data leakage, designed to prevent, collaborative model training
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注: The code for this project is available at this https URL

点击查看摘要

Abstract:Federated Learning (FL) is designed to prevent data leakage through collaborative model training without centralized data storage. However, it remains vulnerable to gradient reconstruction attacks that recover original training data from shared gradients. To optimize the trade-off between data leakage and utility loss, we first derive a theoretical lower bound of reconstruction error (among all attackers) for the two standard methods: adding noise, and gradient pruning. We then customize these two defenses to be parameter- and model-specific and achieve the optimal trade-off between our obtained reconstruction lower bound and model utility. Experimental results validate that our methods outperform Gradient Noise and Gradient Pruning by protecting the training data better while also achieving better utility.

[AI-28] Automating Exploratory Proteomics Research via Language Models

链接: https://arxiv.org/abs/2411.03743
作者: Ning Ding,Shang Qu,Linhai Xie,Yifei Li,Zaoqu Liu,Kaiyan Zhang,Yibai Xiong,Yuxin Zuo,Zhangren Chen,Ermo Hua,Xingtai Lv,Youbang Sun,Yang Li,Dong Li,Fuchu He,Bowen Zhou
关键词-EN: entire research processes, artificial intelligence, producing novel discoveries, development of artificial, contribution to science
类目: Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:With the development of artificial intelligence, its contribution to science is evolving from simulating a complex problem to automating entire research processes and producing novel discoveries. Achieving this advancement requires both specialized general models grounded in real-world scientific data and iterative, exploratory frameworks that mirror human scientific methodologies. In this paper, we present PROTEUS, a fully automated system for scientific discovery from raw proteomics data. PROTEUS uses large language models (LLMs) to perform hierarchical planning, execute specialized bioinformatics tools, and iteratively refine analysis workflows to generate high-quality scientific hypotheses. The system takes proteomics datasets as input and produces a comprehensive set of research objectives, analysis results, and novel biological hypotheses without human intervention. We evaluated PROTEUS on 12 proteomics datasets collected from various biological samples (e.g. immune cells, tumors) and different sample types (single-cell and bulk), generating 191 scientific hypotheses. These were assessed using both automatic LLM-based scoring on 5 metrics and detailed reviews from human experts. Results demonstrate that PROTEUS consistently produces reliable, logically coherent results that align well with existing literature while also proposing novel, evaluable hypotheses. The system’s flexible architecture facilitates seamless integration of diverse analysis tools and adaptation to different proteomics data types. By automating complex proteomics analysis workflows and hypothesis generation, PROTEUS has the potential to considerably accelerate the pace of scientific discovery in proteomics research, enabling researchers to efficiently explore large-scale datasets and uncover biological insights.

[AI-29] Adaptive Consensus Gradients Aggregation for Scaled Distributed Training

链接: https://arxiv.org/abs/2411.03742
作者: Yoni Choukroun,Shlomi Azoulay,Pavel Kisilev
关键词-EN: training large models, Distributed machine learning, vast datasets, critical paradigm, paradigm for training
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Distributed machine learning has recently become a critical paradigm for training large models on vast datasets. We examine the stochastic optimization problem for deep learning within synchronous parallel computing environments under communication constraints. While averaging distributed gradients is the most widely used method for gradient estimation, whether this is the optimal strategy remains an open question. In this work, we analyze the distributed gradient aggregation process through the lens of subspace optimization. By formulating the aggregation problem as an objective-aware subspace optimization problem, we derive an efficient weighting scheme for gradients, guided by subspace coefficients. We further introduce subspace momentum to accelerate convergence while maintaining statistical unbiasedness in the aggregation. Our method demonstrates improved performance over the ubiquitous gradient averaging on multiple MLPerf tasks while remaining extremely efficient in both communicational and computational complexity.

[AI-30] Relation Learning and Aggregate-attention for Multi-person Motion Prediction

链接: https://arxiv.org/abs/2411.03729
作者: Kehua Qu,Rui Ding,Jin Tang
关键词-EN: broad real-world applications, Multi-person motion prediction, real-world applications, motion prediction, emerging and intricate
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Submitted to IEEE Transactions on Multimedia

点击查看摘要

Abstract:Multi-person motion prediction is an emerging and intricate task with broad real-world applications. Unlike single person motion prediction, it considers not just the skeleton structures or human trajectories but also the interactions between others. Previous methods use various networks to achieve impressive predictions but often overlook that the joints relations within an individual (intra-relation) and interactions among groups (inter-relation) are distinct types of representations. These methods often lack explicit representation of interintra-relations, and inevitably introduce undesired dependencies. To address this issue, we introduce a new collaborative framework for multi-person motion prediction that explicitly modeling these relations:a GCN-based network for intra-relations and a novel reasoning network for this http URL, we propose a novel plug-and-play aggregation module called the Interaction Aggregation Module (IAM), which employs an aggregate-attention mechanism to seamlessly integrate these relations. Experiments indicate that the module can also be applied to other dual-path models. Extensive experiments on the 3DPW, 3DPW-RC, CMU-Mocap, MuPoTS-3D, as well as synthesized datasets Mix1 Mix2 (9 to 15 persons), demonstrate that our method achieves state-of-the-art performance.

[AI-31] PropNEAT – Efficient GPU-Compatible Backpropagation over NeuroEvolutionary Augmenting Topology Networks

链接: https://arxiv.org/abs/2411.03726
作者: Michael Merry,Patricia Riddle,Jim Warren
关键词-EN: genomes whilst enabling, NEAT genomes whilst, Machine Learning Benchmarks, Penn Machine Learning, whilst enabling efficient
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:We introduce PropNEAT, a fast backpropagation implementation of NEAT that uses a bidirectional mapping of the genome graph to a layer-based architecture that preserves the NEAT genomes whilst enabling efficient GPU backpropagation. We test PropNEAT on 58 binary classification datasets from the Penn Machine Learning Benchmarks database, comparing the performance against logistic regression, dense neural networks and random forests, as well as a densely retrained variant of the final PropNEAT model. PropNEAT had the second best overall performance, behind Random Forest, though the difference between the models was not statistically significant apart from between Random Forest in comparison with logistic regression and the PropNEAT retrain models. PropNEAT was substantially faster than a naive backpropagation method, and both were substantially faster and had better performance than the original NEAT implementation. We demonstrate that the per-epoch training time for PropNEAT scales linearly with network depth, and is efficient on GPU implementations for backpropagation. This implementation could be extended to support reinforcement learning or convolutional networks, and is able to find sparser and smaller networks with potential for applications in low-power contexts.

[AI-32] AutoGameUI: Constructing High-Fidelity Game UIs via Multimodal Learning and Interactive Web-Based Tool

链接: https://arxiv.org/abs/2411.03709
作者: Zhongliang Tang,Mengchen Tan,Fei Xia,Qingrong Cheng,Hao Jiang,Yongxiang Zhang
关键词-EN: efficiently constructing cohesive, constructing cohesive user, efficiently constructing, cohesive user, constructing cohesive
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 27 pages

点击查看摘要

Abstract:We introduce an innovative system, AutoGameUI, for efficiently constructing cohesive user interfaces in game development. Our system is the first to address the coherence issue arising from integrating inconsistent UI and UX designs, typically leading to mismatches and inefficiencies. We propose a two-stage multimodal learning pipeline to obtain comprehensive representations of both UI and UX designs, and to establish their correspondences. Through the correspondences, a cohesive user interface is automatically constructed from pairwise designs. To achieve high-fidelity effects, we introduce a universal data protocol for precise design descriptions and cross-platform applications. We also develop an interactive web-based tool for game developers to facilitate the use of our system. We create a game UI dataset from actual game projects and combine it with a public dataset for training and evaluation. Our experimental results demonstrate the effectiveness of our system in maintaining coherence between the constructed interfaces and the original designs.

[AI-33] Fine-Tuning Vision-Language Model for Automated Engineering Drawing Information Extraction

链接: https://arxiv.org/abs/2411.03707
作者: Muhammad Tayyab Khan,Lequn Chen,Ye Han Ng,Wenhe Feng,Nicholas Yew Jin Tan,Seung Ki Moon
关键词-EN: Geometric Dimensioning, Dimensioning and Tolerancing, defining acceptable variations, ensure component quality, plays a critical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Paper has been submitted to the 9th International Conference on Innovation in Artificial Intelligence (ICIAI 2025)

点击查看摘要

Abstract:Geometric Dimensioning and Tolerancing (GDT) plays a critical role in manufacturing by defining acceptable variations in part features to ensure component quality and functionality. However, extracting GDT information from 2D engineering drawings is a time-consuming and labor-intensive task, often relying on manual efforts or semi-automated tools. To address these challenges, this study proposes an automated and computationally efficient GDT extraction method by fine-tuning Florence-2, an open-source vision-language model (VLM). The model is trained on a dataset of 400 drawings with ground truth annotations provided by domain experts. For comparison, two state-of-the-art closed-source VLMs, GPT-4o and Claude-3.5-Sonnet, are evaluated on the same dataset. All models are assessed using precision, recall, F1-score, and hallucination metrics. Due to the computational cost and impracticality of fine-tuning large closed-source VLMs for domain-specific tasks, GPT-4o and Claude-3.5-Sonnet are evaluated in a zero-shot setting. In contrast, Florence-2, a smaller model with 0.23 billion parameters, is optimized through full-parameter fine-tuning across three distinct experiments, each utilizing datasets augmented to different levels. The results show that Florence-2 achieves a 29.95% increase in precision, a 37.75% increase in recall, a 52.40% improvement in F1-score, and a 43.15% reduction in hallucination rate compared to the best-performing closed-source model. These findings highlight the effectiveness of fine-tuning smaller, open-source VLMs like Florence-2, offering a practical and efficient solution for automated GDT extraction to support downstream manufacturing tasks.

[AI-34] Beyond Model Adaptation at Test Time: A Survey

链接: https://arxiv.org/abs/2411.03687
作者: Zehao Xiao,Cees G. M. Snoek
关键词-EN: Machine learning algorithms, achieved remarkable success, Machine learning, Test-time adaptation, achieved remarkable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Machine learning algorithms have achieved remarkable success across various disciplines, use cases and applications, under the prevailing assumption that training and test samples are drawn from the same distribution. Consequently, these algorithms struggle and become brittle even when samples in the test distribution start to deviate from the ones observed during training. Domain adaptation and domain generalization have been studied extensively as approaches to address distribution shifts across test and train domains, but each has its limitations. Test-time adaptation, a recently emerging learning paradigm, combines the benefits of domain adaptation and domain generalization by training models only on source data and adapting them to target data during test-time inference. In this survey, we provide a comprehensive and systematic review on test-time adaptation, covering more than 400 recent papers. We structure our review by categorizing existing methods into five distinct categories based on what component of the method is adjusted for test-time adaptation: the model, the inference, the normalization, the sample, or the prompt, providing detailed analysis of each. We further discuss the various preparation and adaptation settings for methods within these categories, offering deeper insights into the effective deployment for the evaluation of distribution shifts and their real-world application in understanding images, video and 3D, as well as modalities beyond vision. We close the survey with an outlook on emerging research opportunities for test-time adaptation.

[AI-35] owards 3D Semantic Scene Completion for Autonomous Driving: A Meta-Learning Framework Empowered by Deformable Large-Kernel Attention and Mamba Model

链接: https://arxiv.org/abs/2411.03672
作者: Yansong Qu,Zilin Huang,Zihao Sheng,Tiantian Chen,Sikai Chen
关键词-EN: Semantic scene completion, autonomous driving systems, scene completion, driving systems, Convolutional Neural Networks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Semantic scene completion (SSC) is essential for achieving comprehensive perception in autonomous driving systems. However, existing SSC methods often overlook the high deployment costs in real-world applications. Traditional architectures, such as 3D Convolutional Neural Networks (3D CNNs) and self-attention mechanisms, face challenges in efficiently capturing long-range dependencies within 3D voxel grids, limiting their effectiveness. To address these issues, we introduce MetaSSC, a novel meta-learning-based framework for SSC that leverages deformable convolution, large-kernel attention, and the Mamba (D-LKA-M) model. Our approach begins with a voxel-based semantic segmentation (SS) pretraining task, aimed at exploring the semantics and geometry of incomplete regions while acquiring transferable meta-knowledge. Using simulated cooperative perception datasets, we supervise the perception training of a single vehicle using aggregated sensor data from multiple nearby connected autonomous vehicles (CAVs), generating richer and more comprehensive labels. This meta-knowledge is then adapted to the target domain through a dual-phase training strategy that does not add extra model parameters, enabling efficient deployment. To further enhance the model’s capability in capturing long-sequence relationships within 3D voxel grids, we integrate Mamba blocks with deformable convolution and large-kernel attention into the backbone network. Extensive experiments demonstrate that MetaSSC achieves state-of-the-art performance, significantly outperforming competing models while also reducing deployment costs.

[AI-36] ouchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation? NEURIPS-2024

链接: https://arxiv.org/abs/2411.03670
作者: Pedro R. A. S. Bassi,Wenxuan Li,Yucheng Tang,Fabian Isensee,Zifu Wang,Jieneng Chen,Yu-Cheng Chou,Yannick Kirchhoff,Maximilian Rokuss,Ziyan Huang,Jin Ye,Junjun He,Tassilo Wald,Constantin Ulrich,Michael Baumgartner,Saikat Roy,Klaus H. Maier-Hein,Paul Jaeger,Yiwen Ye,Yutong Xie,Jianpeng Zhang,Ziyang Chen,Yong Xia,Zhaohu Xing,Lei Zhu,Yousef Sadegheih,Afshin Bozorgpour,Pratibha Kumari,Reza Azad,Dorit Merhof,Pengcheng Shi,Ting Ma,Yuxin Du,Fan Bai,Tiejun Huang,Bo Zhao,Haonan Wang,Xiaomeng Li,Hanxue Gu,Haoyu Dong,Jichen Yang,Maciej A. Mazurowski,Saumya Gupta,Linshan Wu,Jiaxin Zhuang,Hao Chen,Holger Roth,Daguang Xu,Matthew B. Blaschko,Sergio Decherchi,Andrea Cavalli,Alan L. Yuille,Zongwei Zhou
关键词-EN: algorithms, test, benchmark, Abstract, test sets
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted to NeurIPS-2024

点击查看摘要

Abstract:How can we test AI performance? This question seems trivial, but it isn’t. Standard benchmarks often have problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. To address these problems, we present Touchstone, a large-scale collaborative segmentation benchmark of 9 types of abdominal organs. This benchmark is based on 5,195 training CT scans from 76 hospitals around the world and 5,903 testing CT scans from 11 additional hospitals. This diverse test set enhances the statistical significance of benchmark results and rigorously evaluates AI algorithms across various out-of-distribution scenarios. We invited 14 inventors of 19 AI algorithms to train their algorithms, while our team, as a third party, independently evaluated these algorithms on three test sets. In addition, we also evaluated pre-existing AI frameworks–which, differing from algorithms, are more flexible and can support different algorithms–including MONAI from NVIDIA, nnU-Net from DKFZ, and numerous other open-source frameworks. We are committed to expanding this benchmark to encourage more innovation of AI algorithms for the medical domain.

[AI-37] Requirements Engineering for Older Adult Digital Health Software: A Systematic Literature Review ALT

链接: https://arxiv.org/abs/2411.03656
作者: Yuqing Xiao,John Grundy,Anuradha Madugalla
关键词-EN: older adult population, population has led, increasing interest, interest in technology-supported, Digital health
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: arxiv version of SLR on RE for Older Adult Digital Health Software

点击查看摘要

Abstract:Growth of the older adult population has led to an increasing interest in technology-supported aged care. However, the area has some challenges such as a lack of caregivers and limitations in understanding the emotional, social, physical, and mental well-being needs of seniors. Furthermore, there is a gap in the understanding between developers and ageing people of their requirements. Digital health can be important in supporting older adults wellbeing, emotional requirements, and social needs. Requirements Engineering (RE) is a major software engineering field, which can help to identify, elicit and prioritize the requirements of stakeholders and ensure that the systems meet standards for performance, reliability, and usability. We carried out a systematic review of the literature on RE for older adult digital health software. This was necessary to show the representatives of the current stage of understanding the needs of older adults in aged care digital health. Using established guidelines outlined by the Kitchenham method, the PRISMA and the PICO guideline, we developed a protocol, followed by the systematic exploration of eight databases. This resulted in 69 primary studies of high relevance, which were subsequently subjected to data extraction, synthesis, and reporting. We highlight key RE processes in digital health software for ageing people. It explored the utilization of technology for older user well-being and care, and the evaluations of such solutions. The review also identified key limitations found in existing primary studies that inspire future research opportunities. The results indicate that requirement gathering and understanding have a significant variation between different studies. The differences are in the quality, depth, and techniques adopted for requirement gathering and these differences are largely due to uneven adoption of RE methods.

[AI-38] Policy Aggregation

链接: https://arxiv.org/abs/2411.03651
作者: Parand A. Alamdari,Soroush Ebadian,Ariel D. Procaccia
关键词-EN: Markov decision process, underlying Markov decision, underlying Markov, Markov decision, decision process
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the challenge of AI value alignment with multiple individuals that have different reward functions and optimal policies in an underlying Markov decision process. We formalize this problem as one of policy aggregation, where the goal is to identify a desirable collective policy. We argue that an approach informed by social choice theory is especially suitable. Our key insight is that social choice methods can be reinterpreted by identifying ordinal preferences with volumes of subsets of the state-action occupancy polytope. Building on this insight, we demonstrate that a variety of methods–including approval voting, Borda count, the proportional veto core, and quantile fairness–can be practically applied to policy aggregation.

[AI-39] Adaptive Stereo Depth Estimation with Multi-Spectral Images Across All Lighting Conditions

链接: https://arxiv.org/abs/2411.03638
作者: Zihan Qin,Jialei Xu,Wenbo Zhao,Junjun Jiang,Xianming Liu
关键词-EN: significant challenge, adverse conditions remains, remains a significant, Depth estimation, feature matching
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Depth estimation under adverse conditions remains a significant challenge. Recently, multi-spectral depth estimation, which integrates both visible light and thermal images, has shown promise in addressing this issue. However, existing algorithms struggle with precise pixel-level feature matching, limiting their ability to fully exploit geometric constraints across different spectra. To address this, we propose a novel framework incorporating stereo depth estimation to enforce accurate geometric constraints. In particular, we treat the visible light and thermal images as a stereo pair and utilize a Cross-modal Feature Matching (CFM) Module to construct a cost volume for pixel-level matching. To mitigate the effects of poor lighting on stereo matching, we introduce Degradation Masking, which leverages robust monocular thermal depth estimation in degraded regions. Our method achieves state-of-the-art (SOTA) performance on the Multi-Spectral Stereo (MS2) dataset, with qualitative evaluations demonstrating high-quality depth maps under varying lighting conditions.

[AI-40] RTify: Aligning Deep Neural Networks with Human Behavioral Decisions NEURIPS2024

链接: https://arxiv.org/abs/2411.03630
作者: Yu-Ang Cheng,Ivan Felipe Rodriguez,Sixuan Chen,Kohitij Kar,Takeo Watanabe,Thomas Serre
关键词-EN: perceptual decisions’ rich, neglecting perceptual decisions’, neural network, primate vision focus, decisions’ rich
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注: Published at NeurIPS 2024

点击查看摘要

Abstract:Current neural network models of primate vision focus on replicating overall levels of behavioral accuracy, often neglecting perceptual decisions’ rich, dynamic nature. Here, we introduce a novel computational framework to model the dynamics of human behavioral choices by learning to align the temporal dynamics of a recurrent neural network (RNN) to human reaction times (RTs). We describe an approximation that allows us to constrain the number of time steps an RNN takes to solve a task with human RTs. The approach is extensively evaluated against various psychophysics experiments. We also show that the approximation can be used to optimize an “ideal-observer” RNN model to achieve an optimal tradeoff between speed and accuracy without human data. The resulting model is found to account well for human RT data. Finally, we use the approximation to train a deep learning implementation of the popular Wong-Wang decision-making model. The model is integrated with a convolutional neural network (CNN) model of visual processing and evaluated using both artificial and natural image stimuli. Overall, we present a novel framework that helps align current vision models with human behavior, bringing us closer to an integrated model of human vision.

[AI-41] StreamingBench: Assessing the Gap for MLLM s to Achieve Streaming Video Understanding

链接: https://arxiv.org/abs/2411.03628
作者: Junming Lin,Zheng Fang,Chi Chen,Zihao Wan,Fuwen Luo,Peng Li,Yang Liu,Maosong Sun
关键词-EN: Large Language Models, Multimodal Large Language, Language Models, Multimodal Large, Large Language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid development of Multimodal Large Language Models (MLLMs) has expanded their capabilities from image comprehension to video understanding. However, most of these MLLMs focus primarily on offline video comprehension, necessitating extensive processing of all video frames before any queries can be made. This presents a significant gap compared to the human ability to watch, listen, think, and respond to streaming inputs in real time, highlighting the limitations of current MLLMs. In this paper, we introduce StreamingBench, the first comprehensive benchmark designed to evaluate the streaming video understanding capabilities of MLLMs. StreamingBench assesses three core aspects of streaming video understanding: (1) real-time visual understanding, (2) omni-source understanding, and (3) contextual understanding. The benchmark consists of 18 tasks, featuring 900 videos and 4,500 human-curated QA pairs. Each video features five questions presented at different time points to simulate a continuous streaming scenario. We conduct experiments on StreamingBench with 13 open-source and proprietary MLLMs and find that even the most advanced proprietary MLLMs like Gemini 1.5 Pro and GPT-4o perform significantly below human-level streaming video understanding capabilities. We hope our work can facilitate further advancements for MLLMs, empowering them to approach human-level video comprehension and interaction in more realistic scenarios.

[AI-42] Fully Hyperbolic Rotation for Knowledge Graph Embedding

链接: https://arxiv.org/abs/2411.03622
作者: Qiuyu Liang,Weihua Wang,Feilong Bao,Guanglai Gao
关键词-EN: inherent hierarchies, Hyperbolic, hyperbolic space, model, Hyperbolic rotation
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hyperbolic rotation is commonly used to effectively model knowledge graphs and their inherent hierarchies. However, existing hyperbolic rotation models rely on logarithmic and exponential mappings for feature transformation. These models only project data features into hyperbolic space for rotation, limiting their ability to fully exploit the hyperbolic space. To address this problem, we propose a novel fully hyperbolic model designed for knowledge graph embedding. Instead of feature mappings, we define the model directly in hyperbolic space with the Lorentz model. Our model considers each relation in knowledge graphs as a Lorentz rotation from the head entity to the tail entity. We adopt the Lorentzian version distance as the scoring function for measuring the plausibility of triplets. Extensive results on standard knowledge graph completion benchmarks demonstrated that our model achieves competitive results with fewer parameters. In addition, our model get the state-of-the-art performance on datasets of CoDEx-s and CoDEx-m, which are more diverse and challenging than before. Our code is available at this https URL.

[AI-43] An Experimental Study on Decomposition-Based Deep Ensemble Learning for Traffic Flow Forecasting

链接: https://arxiv.org/abs/2411.03588
作者: Qiyuan Zhu,A. K. Qin,Hussein Dia,Adriana-Simona Mihaita,Hanna Grzybowska
关键词-EN: intelligent transport systems, transport systems, crucial task, task in intelligent, intelligent transport
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: This work has been accepted by the 2024 Australasian Joint Conference on Artificial Intelligence (AJCAI 2024)

点击查看摘要

Abstract:Traffic flow forecasting is a crucial task in intelligent transport systems. Deep learning offers an effective solution, capturing complex patterns in time-series traffic flow data to enable the accurate prediction. However, deep learning models are prone to overfitting the intricate details of flow data, leading to poor generalisation. Recent studies suggest that decomposition-based deep ensemble learning methods may address this issue by breaking down a time series into multiple simpler signals, upon which deep learning models are built and ensembled to generate the final prediction. However, few studies have compared the performance of decomposition-based ensemble methods with non-decomposition-based ones which directly utilise raw time-series data. This work compares several decomposition-based and non-decomposition-based deep ensemble learning methods. Experimental results on three traffic datasets demonstrate the superiority of decomposition-based ensemble methods, while also revealing their sensitivity to aggregation strategies and forecasting horizons.

[AI-44] Hybrid Attention for Robust RGB-T Pedestrian Detection in Real-World Conditions

链接: https://arxiv.org/abs/2411.03576
作者: Arunkumar Rathinam,Leo Pauly,Abd El Rahman Shabayek,Wassim Rharbaoui,Anis Kacem,Vincent Gaudillière,Djamila Aouada
关键词-EN: Multispectral pedestrian detection, Multispectral pedestrian, autonomous driving applications, gained significant attention, recent years
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted for publication in IEEE Robotics and Automation Letters, October 2024

点击查看摘要

Abstract:Multispectral pedestrian detection has gained significant attention in recent years, particularly in autonomous driving applications. To address the challenges posed by adversarial illumination conditions, the combination of thermal and visible images has demonstrated its advantages. However, existing fusion methods rely on the critical assumption that the RGB-Thermal (RGB-T) image pairs are fully overlapping. These assumptions often do not hold in real-world applications, where only partial overlap between images can occur due to sensors configuration. Moreover, sensor failure can cause loss of information in one modality. In this paper, we propose a novel module called the Hybrid Attention (HA) mechanism as our main contribution to mitigate performance degradation caused by partial overlap and sensor failure, i.e. when at least part of the scene is acquired by only one sensor. We propose an improved RGB-T fusion algorithm, robust against partial overlap and sensor failure encountered during inference in real-world applications. We also leverage a mobile-friendly backbone to cope with resource constraints in embedded systems. We conducted experiments by simulating various partial overlap and sensor failure scenarios to evaluate the performance of our proposed method. The results demonstrate that our approach outperforms state-of-the-art methods, showcasing its superiority in handling real-world challenges.

[AI-45] owards Personalized Federated Learning via Comprehensive Knowledge Distillation

链接: https://arxiv.org/abs/2411.03569
作者: Pengju Wang,Bochao Liu,Weijia Guo,Yong Li,Shiming Ge
关键词-EN: protect data privacy, distributed machine learning, machine learning paradigm, learning paradigm designed, personalized federated learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by IEEE SMC 2024

点击查看摘要

Abstract:Federated learning is a distributed machine learning paradigm designed to protect data privacy. However, data heterogeneity across various clients results in catastrophic forgetting, where the model rapidly forgets previous knowledge while acquiring new knowledge. To address this challenge, personalized federated learning has emerged to customize a personalized model for each client. However, the inherent limitation of this mechanism is its excessive focus on personalization, potentially hindering the generalization of those models. In this paper, we present a novel personalized federated learning method that uses global and historical models as teachers and the local model as the student to facilitate comprehensive knowledge distillation. The historical model represents the local model from the last round of client training, containing historical personalized knowledge, while the global model represents the aggregated model from the last round of server aggregation, containing global generalized knowledge. By applying knowledge distillation, we effectively transfer global generalized knowledge and historical personalized knowledge to the local model, thus mitigating catastrophic forgetting and enhancing the general performance of personalized models. Extensive experimental results demonstrate the significant advantages of our method.

[AI-46] Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level

链接: https://arxiv.org/abs/2411.03562
作者: Antoine Grosnit,Alexandre Maraval,James Doran,Giuseppe Paolo,Albert Thomas,Refinath Shahul Hameed Nabeezath Beevi,Jonas Gonzalez,Khyati Khandelwal,Ignacio Iacobacci,Abdelhakim Benechehab,Hamza Cherkaoui,Youssef Attia El-Hili,Kun Shao,Jianye Hao,Jun Yao,Balazs Kegl,Haitham Bou-Ammar,Jun Wang
关键词-EN: autonomous data science, data science tasks, diverse data science, science agent designed, data science
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce Agent K v1.0, an end-to-end autonomous data science agent designed to automate, optimise, and generalise across diverse data science tasks. Fully automated, Agent K v1.0 manages the entire data science life cycle by learning from experience. It leverages a highly flexible structured reasoning framework to enable it to dynamically process memory in a nested structure, effectively learning from accumulated experience stored to handle complex reasoning tasks. It optimises long- and short-term memory by selectively storing and retrieving key information, guiding future decisions based on environmental rewards. This iterative approach allows it to refine decisions without fine-tuning or backpropagation, achieving continuous improvement through experiential learning. We evaluate our agent’s apabilities using Kaggle competitions as a case study. Following a fully automated protocol, Agent K v1.0 systematically addresses complex and multimodal data science tasks, employing Bayesian optimisation for hyperparameter tuning and feature engineering. Our new evaluation framework rigorously assesses Agent K v1.0’s end-to-end capabilities to generate and send submissions starting from a Kaggle competition URL. Results demonstrate that Agent K v1.0 achieves a 92.5% success rate across tasks, spanning tabular, computer vision, NLP, and multimodal domains. When benchmarking against 5,856 human Kaggle competitors by calculating Elo-MMR scores for each, Agent K v1.0 ranks in the top 38%, demonstrating an overall skill level comparable to Expert-level users. Notably, its Elo-MMR score falls between the first and third quartiles of scores achieved by human Grandmasters. Furthermore, our results indicate that Agent K v1.0 has reached a performance level equivalent to Kaggle Grandmaster, with a record of 6 gold, 3 silver, and 7 bronze medals, as defined by Kaggle’s progression system.

[AI-47] wo-Stage Pretraining for Molecular Property Prediction in the Wild

链接: https://arxiv.org/abs/2411.03537
作者: Kevin Tirta Wijaya,Minghao Guo,Michael Sun,Hans-Peter Seidel,Wojciech Matusik,Vahid Babaei
关键词-EN: Accurate property prediction, Accurate property, crucial for accelerating, accelerating the discovery, Accurate
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Accurate property prediction is crucial for accelerating the discovery of new molecules. Although deep learning models have achieved remarkable success, their performance often relies on large amounts of labeled data that are expensive and time-consuming to obtain. Thus, there is a growing need for models that can perform well with limited experimentally-validated data. In this work, we introduce MoleVers, a versatile pretrained model designed for various types of molecular property prediction in the wild, i.e., where experimentally-validated molecular property labels are scarce. MoleVers adopts a two-stage pretraining strategy. In the first stage, the model learns molecular representations from large unlabeled datasets via masked atom prediction and dynamic denoising, a novel task enabled by a new branching encoder architecture. In the second stage, MoleVers is further pretrained using auxiliary labels obtained with inexpensive computational methods, enabling supervised learning without the need for costly experimental data. This two-stage framework allows MoleVers to learn representations that generalize effectively across various downstream datasets. We evaluate MoleVers on a new benchmark comprising 22 molecular datasets with diverse types of properties, the majority of which contain 50 or fewer training labels reflecting real-world conditions. MoleVers achieves state-of-the-art results on 20 out of the 22 datasets, and ranks second among the remaining two, highlighting its ability to bridge the gap between data-hungry models and real-world conditions where practically-useful labels are scarce.

[AI-48] Personalized Video Summarization by Multimodal Video Understanding CIKM2024

链接: https://arxiv.org/abs/2411.03531
作者: Brian Chen,Xiangyuan Zhao,Yingnan Zhu
关键词-EN: Video summarization, Video, Video summarization techniques, summarization, video summarization models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: In Proceedings of CIKM 2024 Applied Research Track

点击查看摘要

Abstract:Video summarization techniques have been proven to improve the overall user experience when it comes to accessing and comprehending video content. If the user’s preference is known, video summarization can identify significant information or relevant content from an input video, aiding them in obtaining the necessary information or determining their interest in watching the original video. Adapting video summarization to various types of video and user preferences requires significant training data and expensive human labeling. To facilitate such research, we proposed a new benchmark for video summarization that captures various user preferences. Also, we present a pipeline called Video Summarization with Language (VSL) for user-preferred video summarization that is based on pre-trained visual language models (VLMs) to avoid the need to train a video summarization system on a large training dataset. The pipeline takes both video and closed captioning as input and performs semantic analysis at the scene level by converting video frames into text. Subsequently, the user’s genre preference was used as the basis for selecting the pertinent textual scenes. The experimental results demonstrate that our proposed pipeline outperforms current state-of-the-art unsupervised video summarization models. We show that our method is more adaptable across different datasets compared to supervised query-based video summarization models. In the end, the runtime analysis demonstrates that our pipeline is more suitable for practical use when scaling up the number of user preferences and videos.

[AI-49] AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution

链接: https://arxiv.org/abs/2411.03519
作者: Zhiqiang Xie,Hao Kang,Ying Sheng,Tushar Krishna,Kayvon Fatahalian,Christos Kozyrakis
关键词-EN: large language model, perform complex tasks, advanced natural language, natural language understanding, exhibit emergent behaviors
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:With more advanced natural language understanding and reasoning capabilities, large language model (LLM)-powered agents are increasingly developed in simulated environments to perform complex tasks, interact with other agents, and exhibit emergent behaviors relevant to social science and gaming. However, current multi-agent simulations frequently suffer from inefficiencies due to the limited parallelism caused by false dependencies, resulting in performance bottlenecks. In this paper, we introduce AI Metropolis, a simulation engine that improves the efficiency of LLM agent simulations by incorporating out-of-order execution scheduling. By dynamically tracking real dependencies between agents, AI Metropolis minimizes false dependencies, enhancing parallelism and enabling efficient hardware utilization. Our evaluations demonstrate that AI Metropolis achieves speedups from 1.3x to 4.15x over standard parallel simulation with global synchronization, approaching optimal performance as the number of agents increases.

[AI-50] Watson: A Cognitive Observability Framework for the Reasoning of Foundation Model-Powered Agents

链接: https://arxiv.org/abs/2411.03455
作者: Benjamin Rombaut,Sogol Masoumzadeh,Kirill Vasilevski,Dayi Lin,Ahmed E. Hassan
关键词-EN: increasingly prominent role, introduce significant challenges, complex software systems, foundation models, play an increasingly
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:As foundation models (FMs) play an increasingly prominent role in complex software systems, such as FM-powered agentic software (i.e., Agentware), they introduce significant challenges for developers regarding observability. Unlike traditional software, agents operate autonomously, using extensive data and opaque implicit reasoning, making it difficult to observe and understand their behavior during runtime, especially when they take unexpected actions or encounter errors. In this paper, we highlight the limitations of traditional operational observability in the context of FM-powered software, and introduce cognitive observability as a new type of required observability that has emerged for such innovative systems. We then propose a novel framework that provides cognitive observability into the implicit reasoning processes of agents (a.k.a. reasoning observability), and demonstrate the effectiveness of our framework in boosting the debuggability of Agentware and, in turn, the abilities of an Agentware through a case study on AutoCodeRover, a cuttingedge Agentware for autonomous program improvement.

[AI-51] STEER: Flexible Robotic Manipulation via Dense Language Grounding

链接: https://arxiv.org/abs/2411.03409
作者: Laura Smith,Alex Irpan,Montserrat Gonzalez Arenas,Sean Kirmani,Dmitry Kalashnikov,Dhruv Shah,Ted Xiao
关键词-EN: real world demands, world demands robotic, demands robotic systems, real world, world demands
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Project website: this https URL

点击查看摘要

Abstract:The complexity of the real world demands robotic systems that can intelligently adapt to unseen situations. We present STEER, a robot learning framework that bridges high-level, commonsense reasoning with precise, flexible low-level control. Our approach translates complex situational awareness into actionable low-level behavior through training language-grounded policies with dense annotation. By structuring policy training around fundamental, modular manipulation skills expressed in natural language, STEER exposes an expressive interface for humans or Vision-Language Models (VLMs) to intelligently orchestrate the robot’s behavior by reasoning about the task and context. Our experiments demonstrate the skills learned via STEER can be combined to synthesize novel behaviors to adapt to new situations or perform completely new tasks without additional data collection or training.

[AI-52] An Open API Architecture to Discover the Trustworthy Explanation of Cloud AI Services

链接: https://arxiv.org/abs/2411.03376
作者: Zerui Wang,Yan Liu,Jun Huang
关键词-EN: XAI consistency metrics, cloud AI services, XAI, cloud, XAI consistency
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注: Published in: IEEE Transactions on Cloud Computing ( Volume: 12, Issue: 2, April-June 2024)

点击查看摘要

Abstract:This article presents the design of an open-API-based explainable AI (XAI) service to provide feature contribution explanations for cloud AI services. Cloud AI services are widely used to develop domain-specific applications with precise learning metrics. However, the underlying cloud AI services remain opaque on how the model produces the prediction. We argue that XAI operations are accessible as open APIs to enable the consolidation of the XAI operations into the cloud AI services assessment. We propose a design using a microservice architecture that offers feature contribution explanations for cloud AI services without unfolding the network structure of the cloud models. We can also utilize this architecture to evaluate the model performance and XAI consistency metrics showing cloud AI services trustworthiness. We collect provenance data from operational pipelines to enable reproducibility within the XAI service. Furthermore, we present the discovery scenarios for the experimental tests regarding model performance and XAI consistency metrics for the leading cloud vision AI services. The results confirm that the architecture, based on open APIs, is cloud-agnostic. Additionally, data augmentations result in measurable improvements in XAI consistency metrics for cloud AI services.

[AI-53] Enhanced Real-Time Threat Detection in 5G Networks: A Self-Attention RNN Autoencoder Approach for Spectral Intrusion Analysis

链接: https://arxiv.org/abs/2411.03365
作者: Mohammadreza Kouchaki,Minglong Zhang,Aly S. Abdalla,Guangchen Lan,Christopher G. Brinton,Vuk Marojevic
关键词-EN: safeguarding Radio Frequency, rapidly evolving landscape, Radio Frequency, Recurrent Neural Network, environments against sophisticated
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: This article has been accepted for publication in WiOpt 2024

点击查看摘要

Abstract:In the rapidly evolving landscape of 5G technology, safeguarding Radio Frequency (RF) environments against sophisticated intrusions is paramount, especially in dynamic spectrum access and management. This paper presents an enhanced experimental model that integrates a self-attention mechanism with a Recurrent Neural Network (RNN)-based autoencoder for the detection of anomalous spectral activities in 5G networks at the waveform level. Our approach, grounded in time-series analysis, processes in-phase and quadrature (I/Q) samples to identify irregularities that could indicate potential jamming attacks. The model’s architecture, augmented with a self-attention layer, extends the capabilities of RNN autoencoders, enabling a more nuanced understanding of temporal dependencies and contextual relationships within the RF spectrum. Utilizing a simulated 5G Radio Access Network (RAN) test-bed constructed with srsRAN 5G and Software Defined Radios (SDRs), we generated a comprehensive stream of data that reflects real-world RF spectrum conditions and attack scenarios. The model is trained to reconstruct standard signal behavior, establishing a normative baseline against which deviations, indicative of security threats, are identified. The proposed architecture is designed to balance between detection precision and computational efficiency, so the LSTM network, enriched with self-attention, continues to optimize for minimal execution latency and power consumption. Conducted on a real-world SDR-based testbed, our results demonstrate the model’s improved performance and accuracy in threat detection. Keywords: self-attention, real-time intrusion detection, RNN autoencoder, Transformer architecture, LSTM, time series anomaly detection, 5G Security, spectrum access security. Comments: This article has been accepted for publication in WiOpt 2024 Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI) Cite as: arXiv:2411.03365 [cs.CR] (or arXiv:2411.03365v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2411.03365 Focus to learn more arXiv-issued DOI via DataCite

[AI-54] Self-Calibrated Tuning of Vision-Language Models for Out-of-Distribution Detection NEURIPS2024

链接: https://arxiv.org/abs/2411.03359
作者: Geng Yu,Jianing Zhu,Jiangchao Yao,Bo Han
关键词-EN: deploying reliable machine, OOD detection, reliable machine learning, machine learning models, OOD
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: accepted by NeurIPS 2024

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is crucial for deploying reliable machine learning models in open-world applications. Recent advances in CLIP-based OOD detection have shown promising results via regularizing prompt tuning with OOD features extracted from ID data. However, the irrelevant context mined from ID data can be spurious due to the inaccurate foreground-background decomposition, thus limiting the OOD detection performance. In this work, we propose a novel framework, namely, Self-Calibrated Tuning (SCT), to mitigate this problem for effective OOD detection with only the given few-shot ID data. Specifically, SCT introduces modulating factors respectively on the two components of the original learning objective. It adaptively directs the optimization process between the two tasks during training on data with different prediction uncertainty to calibrate the influence of OOD regularization, which is compatible with many prompt tuning based OOD detection methods. Extensive experiments and analyses have been conducted to characterize and demonstrate the effectiveness of the proposed SCT. The code is publicly available.

[AI-55] Enhancing Table Representations with LLM -powered Synthetic Data Generation

链接: https://arxiv.org/abs/2411.03356
作者: Dayu Yang,Natawut Monaikul,Amanda Ding,Bozhao Tan,Kishore Mosaliganti,Giri Iyengar
关键词-EN: improving table management, increasingly crucial, crucial for improving, accurate table-level representations, data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: the Thirty-Eighth Annual Conference on Neural Information Processing Systems Table Representation Workshop

点击查看摘要

Abstract:In the era of data-driven decision-making, accurate table-level representations and efficient table recommendation systems are becoming increasingly crucial for improving table management, discovery, and analysis. However, existing approaches to tabular data representation often face limitations, primarily due to their focus on cell-level tasks and the lack of high-quality training data. To address these challenges, we first formulate a clear definition of table similarity in the context of data transformation activities within data-driven enterprises. This definition serves as the foundation for synthetic data generation, which require a well-defined data generation process. Building on this, we propose a novel synthetic data generation pipeline that harnesses the code generation and data manipulation capabilities of Large Language Models (LLMs) to create a large-scale synthetic dataset tailored for table-level representation learning. Through manual validation and performance comparisons on the table recommendation task, we demonstrate that the synthetic data generated by our pipeline aligns with our proposed definition of table similarity and significantly enhances table representations, leading to improved recommendation performance.

[AI-56] Exploring Feature Importance and Explainability Towards Enhanced ML-Based DoS Detection in AI Systems

链接: https://arxiv.org/abs/2411.03355
作者: Paul Badu Yakubu,Evans Owusu,Lesther Santana,Mohamed Rahouti,Abdellah Chehri,Kaiqi Xiong
关键词-EN: Denial of Service, causing substantial financial, substantial financial losses, systems security, causing substantial
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages, 2 figures, IEEE VTC2024-Fall

点击查看摘要

Abstract:Denial of Service (DoS) attacks pose a significant threat in the realm of AI systems security, causing substantial financial losses and downtime. However, AI systems’ high computational demands, dynamic behavior, and data variability make monitoring and detecting DoS attacks challenging. Nowadays, statistical and machine learning (ML)-based DoS classification and detection approaches utilize a broad range of feature selection mechanisms to select a feature subset from networking traffic datasets. Feature selection is critical in enhancing the overall model performance and attack detection accuracy while reducing the training time. In this paper, we investigate the importance of feature selection in improving ML-based detection of DoS attacks. Specifically, we explore feature contribution to the overall components in DoS traffic datasets by utilizing statistical analysis and feature engineering approaches. Our experimental findings demonstrate the usefulness of the thorough statistical analysis of DoS traffic and feature engineering in understanding the behavior of the attack and identifying the best feature selection for ML-based DoS classification and detection.

[AI-57] abular Data Synthesis with Differential Privacy: A Survey

链接: https://arxiv.org/abs/2411.03351
作者: Mengmeng Yang,Chi-Hung Chi,Kwok-Yan Lam,Jie Feng,Taolin Guo,Wei Ni
关键词-EN: leverage diverse datasets, collaborative innovation, enabling organizations, Data, prerequisite for collaborative
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Data sharing is a prerequisite for collaborative innovation, enabling organizations to leverage diverse datasets for deeper insights. In real-world applications like FinTech and Smart Manufacturing, transactional data, often in tabular form, are generated and analyzed for insight generation. However, such datasets typically contain sensitive personal/business information, raising privacy concerns and regulatory risks. Data synthesis tackles this by generating artificial datasets that preserve the statistical characteristics of real data, removing direct links to individuals. However, attackers can still infer sensitive information using background knowledge. Differential privacy offers a solution by providing provable and quantifiable privacy protection. Consequently, differentially private data synthesis has emerged as a promising approach to privacy-aware data sharing. This paper provides a comprehensive overview of existing differentially private tabular data synthesis methods, highlighting the unique challenges of each generation model for generating tabular data under differential privacy constraints. We classify the methods into statistical and deep learning-based approaches based on their generation models, discussing them in both centralized and distributed environments. We evaluate and compare those methods within each category, highlighting their strengths and weaknesses in terms of utility, privacy, and computational complexity. Additionally, we present and discuss various evaluation methods for assessing the quality of the synthesized data, identify research gaps in the field and directions for future research.

[AI-58] Undermining Image and Text Classification Algorithms Using Adversarial Attacks

链接: https://arxiv.org/abs/2411.03348
作者: Langalibalele Lunga,Suhas Sreehari
关键词-EN: Synthetic Minority Oversampling, Generative Adversarial Networks, Minority Oversampling Technique, adversarial attacks, Convolutional Neural Network
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted for presentation at Electronic Imaging Conference 2025

点击查看摘要

Abstract:Machine learning models are prone to adversarial attacks, where inputs can be manipulated in order to cause misclassifications. While previous research has focused on techniques like Generative Adversarial Networks (GANs), there’s limited exploration of GANs and Synthetic Minority Oversampling Technique (SMOTE) in text and image classification models to perform adversarial attacks. Our study addresses this gap by training various machine learning models and using GANs and SMOTE to generate additional data points aimed at attacking text classification models. Furthermore, we extend our investigation to face recognition models, training a Convolutional Neural Network(CNN) and subjecting it to adversarial attacks with fast gradient sign perturbations on key features identified by GradCAM, a technique used to highlight key image characteristics CNNs use in classification. Our experiments reveal a significant vulnerability in classification models. Specifically, we observe a 20 % decrease in accuracy for the top-performing text classification models post-attack, along with a 30 % decrease in facial recognition accuracy. This highlights the susceptibility of these models to manipulation of input data. Adversarial attacks not only compromise the security but also undermine the reliability of machine learning systems. By showcasing the impact of adversarial attacks on both text classification and face recognition models, our study underscores the urgent need for develop robust defenses against such vulnerabilities.

[AI-59] owards evaluations-based safety cases for AI scheming

链接: https://arxiv.org/abs/2411.03336
作者: Mikita Balesni,Marius Hobbhahn,David Lindner,Alex Meinke,Tomek Korbak,Joshua Clymer,Buck Shlegeris,Jérémy Scheurer,Rusheb Shah,Nicholas Goldowsky-Dill,Dan Braun,Bilal Chughtai,Owain Evans,Daniel Kokotajlo,Lucius Bushnaq
关键词-EN: structured rationale, systems, construct a structured, scheming, frontier AI systems
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We sketch how developers of frontier AI systems could construct a structured rationale – a ‘safety case’ – that an AI system is unlikely to cause catastrophic outcomes through scheming. Scheming is a potential threat model where AI systems could pursue misaligned goals covertly, hiding their true capabilities and objectives. In this report, we propose three arguments that safety cases could use in relation to scheming. For each argument we sketch how evidence could be gathered from empirical evaluations, and what assumptions would need to be met to provide strong assurance. First, developers of frontier AI systems could argue that AI systems are not capable of scheming (Scheming Inability). Second, one could argue that AI systems are not capable of posing harm through scheming (Harm Inability). Third, one could argue that control measures around the AI systems would prevent unacceptable outcomes even if the AI systems intentionally attempted to subvert them (Harm Control). Additionally, we discuss how safety cases might be supported by evidence that an AI system is reasonably aligned with its developers (Alignment). Finally, we point out that many of the assumptions required to make these safety arguments have not been confidently satisfied to date and require making progress on multiple open research problems.

[AI-60] Designing Robust Cyber-Defense Agents with Evolving Behavior Trees

链接: https://arxiv.org/abs/2410.16383
作者: Nicholas Potteiger,Ankita Samaddar,Hunter Bergstrom,Xenofon Koutsoukos
关键词-EN: Modern network defense, Modern network, learning-enabled components, offloading tedious, tedious and time-consuming
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 10 pages, 8 figures

点击查看摘要

Abstract:Modern network defense can benefit from the use of autonomous systems, offloading tedious and time-consuming work to agents with standard and learning-enabled components. These agents, operating on critical network infrastructure, need to be robust and trustworthy to ensure defense against adaptive cyber-attackers and, simultaneously, provide explanations for their actions and network activity. However, learning-enabled components typically use models, such as deep neural networks, that are not transparent in their high-level decision-making leading to assurance challenges. Additionally, cyber-defense agents must execute complex long-term defense tasks in a reactive manner that involve coordination of multiple interdependent subtasks. Behavior trees are known to be successful in modelling interpretable, reactive, and modular agent policies with learning-enabled components. In this paper, we develop an approach to design autonomous cyber defense agents using behavior trees with learning-enabled components, which we refer to as Evolving Behavior Trees (EBTs). We learn the structure of an EBT with a novel abstract cyber environment and optimize learning-enabled components for deployment. The learning-enabled components are optimized for adapting to various cyber-attacks and deploying security mechanisms. The learned EBT structure is evaluated in a simulated cyber environment, where it effectively mitigates threats and enhances network visibility. For deployment, we develop a software architecture for evaluating EBT-based agents in computer network defense scenarios. Our results demonstrate that the EBT-based agent is robust to adaptive cyber-attacks and provides high-level explanations for interpreting its decisions and actions.

[AI-61] Masked Multi-Query Slot Attention for Unsupervised Object Discovery IJCNN2024

链接: https://arxiv.org/abs/2404.19654
作者: Rishav Pramanik,José-Fabian Villa-Vásquez,Marco Pedersoli
关键词-EN: tackling recognition problems, Unsupervised object discovery, image into entities, Unsupervised object, essential line
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Paper accepted for presentation at IJCNN 2024

点击查看摘要

Abstract:Unsupervised object discovery is becoming an essential line of research for tackling recognition problems that require decomposing an image into entities, such as semantic segmentation and object detection. Recently, object-centric methods that leverage self-supervision have gained popularity, due to their simplicity and adaptability to different settings and conditions. However, those methods do not exploit effective techniques already employed in modern self-supervised approaches. In this work, we consider an object-centric approach in which DINO ViT features are reconstructed via a set of queried representations called slots. Based on that, we propose a masking scheme on input features that selectively disregards the background regions, inducing our model to focus more on salient objects during the reconstruction phase. Moreover, we extend the slot attention to a multi-query approach, allowing the model to learn multiple sets of slots, producing more stable masks. During training, these multiple sets of slots are learned independently while, at test time, these sets are merged through Hungarian matching to obtain the final slots. Our experimental results and ablations on the PASCAL-VOC 2012 dataset show the importance of each component and highlight how their combination consistently improves object localization. Our source code is available at: this https URL

[AI-62] Sub-DM:Subspace Diffusion Model with Orthogonal Decomposition for MRI Reconstruction

链接: https://arxiv.org/abs/2411.03758
作者: Yu Guan,Qinrong Cai,Wei Li,Qiuyun Fan,Dong Liang,Qiegen Liu
关键词-EN: model-based approaches recently, approaches recently achieved, recently achieved re-markable, achieved re-markable success, clinical routine remains
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 11 figures

点击查看摘要

Abstract:Diffusion model-based approaches recently achieved re-markable success in MRI reconstruction, but integration into clinical routine remains challenging due to its time-consuming convergence. This phenomenon is partic-ularly notable when directly apply conventional diffusion process to k-space data without considering the inherent properties of k-space sampling, limiting k-space learning efficiency and image reconstruction quality. To tackle these challenges, we introduce subspace diffusion model with orthogonal decomposition, a method (referred to as Sub-DM) that restrict the diffusion process via projections onto subspace as the k-space data distribution evolves toward noise. Particularly, the subspace diffusion model circumvents the inference challenges posed by the com-plex and high-dimensional characteristics of k-space data, so the highly compact subspace ensures that diffusion process requires only a few simple iterations to produce accurate prior information. Furthermore, the orthogonal decomposition strategy based on wavelet transform hin-ders the information loss during the migration of the vanilla diffusion process to the subspace. Considering the strate-gy is approximately reversible, such that the entire pro-cess can be reversed. As a result, it allows the diffusion processes in different spaces to refine models through a mutual feedback mechanism, enabling the learning of ac-curate prior even when dealing with complex k-space data. Comprehensive experiments on different datasets clearly demonstrate that the superiority of Sub-DM against state of-the-art methods in terms of reconstruction speed and quality.

[AI-63] Cross Feature Fusion of Fundus Image and Generated Lesion Map for Referable Diabetic Retinopathy Classification ACCV2024

链接: https://arxiv.org/abs/2411.03618
作者: Dahyun Mok,Junghyun Bum,Le Duc Tai,Hyunseung Choo
关键词-EN: Diabetic Retinopathy, necessitating early detection, necessitating early, detection and diagnosis, early detection
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: ACCV 2024 accepted

点击查看摘要

Abstract:Diabetic Retinopathy (DR) is a primary cause of blindness, necessitating early detection and diagnosis. This paper focuses on referable DR classification to enhance the applicability of the proposed method in clinical practice. We develop an advanced cross-learning DR classification method leveraging transfer learning and cross-attention mechanisms. The proposed method employs the Swin U-Net architecture to segment lesion maps from DR fundus images. The Swin U-Net segmentation model, enriched with DR lesion insights, is transferred to generate a lesion map. Both the fundus image and its segmented lesion map are used as complementary inputs for the classification model. A cross-attention mechanism is deployed to improve the model’s ability to capture fine-grained details from the input pairs. Our experiments, utilizing two public datasets, FGADR and EyePACS, demonstrate a superior accuracy of 94.6%, surpassing current state-of-the-art methods by 4.4%. To this end, we aim for the proposed method to be seamlessly integrated into clinical workflows, enhancing accuracy and efficiency in identifying referable DR.

[AI-64] Exploring the Potentials and Challenges of Using Large Language Models for the Analysis of Transcriptional Regulation of Long Non-coding RNAs

链接: https://arxiv.org/abs/2411.03522
作者: Wei Wang,Zhichao Hou,Xiaorui Liu,Xinxia Peng
关键词-EN: long non-coding RNAs, garnered significant attention, significant attention due, Research on long, non-coding RNAs
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Research on long non-coding RNAs (lncRNAs) has garnered significant attention due to their critical roles in gene regulation and disease mechanisms. However, the complexity and diversity of lncRNA sequences, along with the limited knowledge of their functional mechanisms and the regulation of their expressions, pose significant challenges to lncRNA studies. Given the tremendous success of large language models (LLMs) in capturing complex dependencies in sequential data, this study aims to systematically explore the potential and limitations of LLMs in the sequence analysis related to the transcriptional regulation of lncRNA genes. Our extensive experiments demonstrated promising performance of fine-tuned genome foundation models on progressively complex tasks. Furthermore, we conducted an insightful analysis of the critical impact of task complexity, model selection, data quality, and biological interpretability for the studies of the regulation of lncRNA gene expression.

[AI-65] Neurons for Neutrons: A Transformer Model for Computation Load Estimation on Domain-Decomposed Neutron Transport Problems

链接: https://arxiv.org/abs/2411.03389
作者: Alexander Mote,Todd Palmer,Lizhong Chen
关键词-EN: reduce memory overhead, large neutron transport, reduce memory, memory overhead, overhead on large
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI)
*备注: 28 pages, 14 figures

点击查看摘要

Abstract:Domain decomposition is a technique used to reduce memory overhead on large neutron transport problems. Currently, the optimal load-balanced processor allocation for these domains is typically determined through small-scale simulations of the problem, which can be time-consuming for researchers and must be repeated anytime a problem input is changed. We propose a Transformer model with a unique 3D input embedding, and input representations designed for domain-decomposed neutron transport problems, which can predict the subdomain computation loads generated by small-scale simulations. We demonstrate that such a model trained on domain-decomposed Small Modular Reactor (SMR) simulations achieves 98.2% accuracy while being able to skip the small-scale simulation step entirely. Tests of the model’s robustness on variant fuel assemblies, other problem geometries, and changes in simulation parameters are also discussed.

[AI-66] Neural Network Prediction of Strong Lensing Systems with Domain Adaptation and Uncertainty Quantification NEURIPS2024

链接: https://arxiv.org/abs/2411.03334
作者: Shrihan Agarwal,Aleksandra Ćiprijanović,Brian D. Nord
关键词-EN: Modeling strong gravitational, next-generation cosmic surveys, Modeling strong, strong gravitational lenses, data
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Astrophysics of Galaxies (astro-ph.GA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to the Machine Learning for Physical Sciences workshop at NeurIPS 2024; 24 pages, 2 figures, 4 tables

点击查看摘要

Abstract:Modeling strong gravitational lenses is computationally expensive for the complex data from modern and next-generation cosmic surveys. Deep learning has emerged as a promising approach for finding lenses and predicting lensing parameters, such as the Einstein radius. Mean-variance Estimators (MVEs) are a common approach for obtaining aleatoric (data) uncertainties from a neural network prediction. However, neural networks have not been demonstrated to perform well on out-of-domain target data successfully - e.g., when trained on simulated data and applied to real, observational data. In this work, we perform the first study of the efficacy of MVEs in combination with unsupervised domain adaptation (UDA) on strong lensing data. The source domain data is noiseless, and the target domain data has noise mimicking modern cosmology surveys. We find that adding UDA to MVE increases the accuracy on the target data by a factor of about two over an MVE model without UDA. Including UDA also permits much more well-calibrated aleatoric uncertainty predictions. Advancements in this approach may enable future applications of MVE models to real observational data.

[AI-67] og-RRIM: Yield Prediction via Local-to-global Reaction Representation Learning and Interaction Modeling

链接: https://arxiv.org/abs/2411.03320
作者: Xiao Hu,Ziqi Chen,Daniel Adu-Ampratwum,Bo Peng,Xia Ning
关键词-EN: potentially reducing time, Accurate prediction, optimizing organic synthesis, potentially reducing, spent on experimentation
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 18 pages, 8 figures

点击查看摘要

Abstract:Accurate prediction of chemical reaction yields is crucial for optimizing organic synthesis, potentially reducing time and resources spent on experimentation. With the rise of artificial intelligence (AI), there is growing interest in leveraging AI-based methods to accelerate yield predictions without conducting in vitro experiments. We present log-RRIM, an innovative graph transformer-based framework designed for predicting chemical reaction yields. Our approach implements a unique local-to-global reaction representation learning strategy. This approach initially captures detailed molecule-level information and then models and aggregates intermolecular interactions, ensuring that the impact of varying-sizes molecular fragments on yield is accurately accounted for. Another key feature of log-RRIM is its integration of a cross-attention mechanism that focuses on the interplay between reagents and reaction centers. This design reflects a fundamental principle in chemical reactions: the crucial role of reagents in influencing bond-breaking and formation processes, which ultimately affect reaction yields. log-RRIM outperforms existing methods in our experiments, especially for medium to high-yielding reactions, proving its reliability as a predictor. Its advanced modeling of reactant-reagent interactions and sensitivity to small molecular fragments make it a valuable tool for reaction planning and optimization in chemical synthesis. The data and codes of log-RRIM are accessible through this https URL.

计算机视觉

[CV-0] Community Forensics: Using Thousands of Generators to Train Fake Image Detectors

链接: https://arxiv.org/abs/2411.04125
作者: Jeongsoo Park,Andrew Owens
关键词-EN: previously unseen generative, detecting AI-generated images, unseen generative models, key challenges, challenges of detecting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages

点击查看摘要

Abstract:One of the key challenges of detecting AI-generated images is spotting images that have been created by previously unseen generative models. We argue that the limited diversity of the training data is a major obstacle to addressing this problem, and we propose a new dataset that is significantly larger and more diverse than prior work. As part of creating this dataset, we systematically download thousands of text-to-image latent diffusion models and sample images from them. We also collect images from dozens of popular open source and commercial models. The resulting dataset contains 2.7M images that have been sampled from 4803 different models. These images collectively capture a wide range of scene content, generator architectures, and image processing settings. Using this dataset, we study the generalization abilities of fake image detectors. Our experiments suggest that detection performance improves as the number of models in the training set increases, even when these models have similar architectures. We also find that detection performance improves as the diversity of the models increases, and that our trained detectors generalize better than those trained on other datasets.

[CV-1] xtual Decomposition Then Sub-motion-space Scattering for Open-Vocabulary Motion Generation

链接: https://arxiv.org/abs/2411.04079
作者: Ke Fan,Jiangning Zhang,Ran Yi,Jingyu Gong,Yabiao Wang,Yating Wang,Xin Tan,Chengjie Wang,Lizhuang Ma
关键词-EN: open-vocabulary motion generation, motion generation, motion, open-vocabulary motion, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: project page: this https URL

点击查看摘要

Abstract:Text-to-motion generation is a crucial task in computer vision, which generates the target 3D motion by the given text. The existing annotated datasets are limited in scale, resulting in most existing methods overfitting to the small datasets and unable to generalize to the motions of the open domain. Some methods attempt to solve the open-vocabulary motion generation problem by aligning to the CLIP space or using the Pretrain-then-Finetuning paradigm. However, the current annotated dataset’s limited scale only allows them to achieve mapping from sub-text-space to sub-motion-space, instead of mapping between full-text-space and full-motion-space (full mapping), which is the key to attaining open-vocabulary motion generation. To this end, this paper proposes to leverage the atomic motion (simple body part motions over a short time period) as an intermediate representation, and leverage two orderly coupled steps, i.e., Textual Decomposition and Sub-motion-space Scattering, to address the full mapping problem. For Textual Decomposition, we design a fine-grained description conversion algorithm, and combine it with the generalization ability of a large language model to convert any given motion text into atomic texts. Sub-motion-space Scattering learns the compositional process from atomic motions to the target motions, to make the learned sub-motion-space scattered to form the full-motion-space. For a given motion of the open domain, it transforms the extrapolation into interpolation and thereby significantly improves generalization. Our network, DSO -Net, combines textual d ecomposition and sub-motion-space s cattering to solve the o pen-vocabulary motion generation. Extensive experiments demonstrate that our DSO-Net achieves significant improvements over the state-of-the-art methods on open-vocabulary motion generation. Code is available at this https URL.

[CV-2] H-POPE: Hierarchical Polling-based Probing Evaluation of Hallucinations in Large Vision-Language Models

链接: https://arxiv.org/abs/2411.04077
作者: Nhi Pham,Michael Schott
关键词-EN: large vision language, shown significant progress, vision language models, large vision, multi-modal tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Poster at this https URL

点击查看摘要

Abstract:By leveraging both texts and images, large vision language models (LVLMs) have shown significant progress in various multi-modal tasks. Nevertheless, these models often suffer from hallucinations, e.g., they exhibit inconsistencies between the visual input and the textual output. To address this, we propose H-POPE, a coarse-to-fine-grained benchmark that systematically assesses hallucination in object existence and attributes. Our evaluation shows that models are prone to hallucinations on object existence, and even more so on fine-grained attributes. We further investigate whether these models rely on visual input to formulate the output texts.

[CV-3] Pseudo-labeling with Keyword Refining for Few-Supervised Video Captioning

链接: https://arxiv.org/abs/2411.04059
作者: Ping Li,Tao Wang,Xinkui Zhao,Xianghua Xu,Mingli Song
关键词-EN: Video captioning generate, few-supervised video captioning, Video captioning, Video, video content
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 figures, Accepted in Pattern Recognition

点击查看摘要

Abstract:Video captioning generate a sentence that describes the video content. Existing methods always require a number of captions (\eg, 10 or 20) per video to train the model, which is quite costly. In this work, we explore the possibility of using only one or very few ground-truth sentences, and introduce a new task named few-supervised video captioning. Specifically, we propose a few-supervised video captioning framework that consists of lexically constrained pseudo-labeling module and keyword-refined captioning module. Unlike the random sampling in natural language processing that may cause invalid modifications (\ie, edit words), the former module guides the model to edit words using some actions (\eg, copy, replace, insert, and delete) by a pretrained token-level classifier, and then fine-tunes candidate sentences by a pretrained language model. Meanwhile, the former employs the repetition penalized sampling to encourage the model to yield concise pseudo-labeled sentences with less repetition, and selects the most relevant sentences upon a pretrained video-text model. Moreover, to keep semantic consistency between pseudo-labeled sentences and video content, we develop the transformer-based keyword refiner with the video-keyword gated fusion strategy to emphasize more on relevant words. Extensive experiments on several benchmarks demonstrate the advantages of the proposed approach in both few-supervised and fully-supervised scenarios. The code implementation is available at this https URL

[CV-4] Multi-branch Spatio-Temporal Graph Neural Network For Efficient Ice Layer Thickness Prediction

链接: https://arxiv.org/abs/2411.04055
作者: Zesheng Liu,Maryam Rahnemoonfar
关键词-EN: Understanding spatio-temporal patterns, assessing ice dynamics, ice sheet balance, graph neural network, spatio-temporal graph neural
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Understanding spatio-temporal patterns in polar ice layers is essential for tracking changes in ice sheet balance and assessing ice dynamics. While convolutional neural networks are widely used in learning ice layer patterns from raw echogram images captured by airborne snow radar sensors, noise in the echogram images prevents researchers from getting high-quality results. Instead, we focus on geometric deep learning using graph neural networks, aiming to build a spatio-temporal graph neural network that learns from thickness information of the top ice layers and predicts for deeper layers. In this paper, we developed a novel multi-branch spatio-temporal graph neural network that used the GraphSAGE framework for spatio features learning and a temporal convolution operation to capture temporal changes, enabling different branches of the network to be more specialized and focusing on a single learning task. We found that our proposed multi-branch network can consistently outperform the current fused spatio-temporal graph neural network in both accuracy and efficiency.

[CV-5] Local vs distributed representations: What is the right basis for interpretability?

链接: https://arxiv.org/abs/2411.03993
作者: Julien Colin,Lore Goetschalckx,Thomas Fel,Victor Boutin,Jay Gopal,Thomas Serre,Nuria Oliver
关键词-EN: maximally activate individual, activate individual neurons, deep neural networks, individual neurons, individual neurons tend
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Much of the research on the interpretability of deep neural networks has focused on studying the visual features that maximally activate individual neurons. However, recent work has cast doubts on the usefulness of such local representations for understanding the behavior of deep neural networks because individual neurons tend to respond to multiple unrelated visual patterns, a phenomenon referred to as “superposition”. A promising alternative to disentangle these complex patterns is learning sparsely distributed vector representations from entire network layers, as the resulting basis vectors seemingly encode single identifiable visual patterns consistently. Thus, one would expect the resulting code to align better with human perceivable visual patterns, but supporting evidence remains, at best, anecdotal. To fill this gap, we conducted three large-scale psychophysics experiments collected from a pool of 560 participants. Our findings provide (i) strong evidence that features obtained from sparse distributed representations are easier to interpret by human observers and (ii) that this effect is more pronounced in the deepest layers of a neural network. Complementary analyses also reveal that (iii) features derived from sparse distributed representations contribute more to the model’s decision. Overall, our results highlight that distributed representations constitute a superior basis for interpretability, underscoring a need for the field to move beyond the interpretation of local neural codes in favor of sparsely distributed ones.

[CV-6] ET-SEED: Efficient Trajectory-Level SE(3) Equivariant Diffusion Policy

链接: https://arxiv.org/abs/2411.03990
作者: Chenrui Tie,Yue Chen,Ruihai Wu,Boxuan Dong,Zeyi Li,Chongkai Gao,Hao Dong
关键词-EN: Imitation learning, manipulation tasks, proven effective, robotic manipulation tasks, Imitation
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accept to CoRL 2024 Workshop on X-Embodiment Robot Learning

点击查看摘要

Abstract:Imitation learning, e.g., diffusion policy, has been proven effective in various robotic manipulation tasks. However, extensive demonstrations are required for policy robustness and generalization. To reduce the demonstration reliance, we leverage spatial symmetry and propose ET-SEED, an efficient trajectory-level SE(3) equivariant diffusion model for generating action sequences in complex robot manipulation tasks. Further, previous equivariant diffusion models require the per-step equivariance in the Markov process, making it difficult to learn policy under such strong constraints. We theoretically extend equivariant Markov kernels and simplify the condition of equivariant diffusion process, thereby significantly improving training efficiency for trajectory-level SE(3) equivariant diffusion policy in an end-to-end manner. We evaluate ET-SEED on representative robotic manipulation tasks, involving rigid body, articulated and deformable object. Experiments demonstrate superior data efficiency and manipulation proficiency of our proposed method, as well as its ability to generalize to unseen configurations with only a few demonstrations. Website: this https URL

[CV-7] ReEdit: Multimodal Exemplar-Based Image Editing with Diffusion Models

链接: https://arxiv.org/abs/2411.03982
作者: Ashutosh Srivastava,Tarun Ram Menta,Abhinav Java,Avadhoot Jadhav,Silky Singh,Surgan Jandial,Balaji Krishnamurthy
关键词-EN: high-quality photorealistic images, Diffusion models, revolutionized image editing, enabling the generation, generation of high-quality
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: First three authors contributed equally to this work

点击查看摘要

Abstract:Modern Text-to-Image (T2I) Diffusion models have revolutionized image editing by enabling the generation of high-quality photorealistic images. While the de facto method for performing edits with T2I models is through text instructions, this approach non-trivial due to the complex many-to-many mapping between natural language and images. In this work, we address exemplar-based image editing – the task of transferring an edit from an exemplar pair to a content image(s). We propose ReEdit, a modular and efficient end-to-end framework that captures edits in both text and image modalities while ensuring the fidelity of the edited image. We validate the effectiveness of ReEdit through extensive comparisons with state-of-the-art baselines and sensitivity analyses of key design choices. Our results demonstrate that ReEdit consistently outperforms contemporary approaches both qualitatively and quantitatively. Additionally, ReEdit boasts high practical applicability, as it does not require any task-specific optimization and is four times faster than the next best baseline.

[CV-8] HRDecoder: High-Resolution Decoder Network for Fundus Image Lesion Segmentation MICCAI2024

链接: https://arxiv.org/abs/2411.03976
作者: Ziyuan Ding,Yixiong Liang,Shichao Kan,Qing Liu
关键词-EN: GPU memory costs, incurs considerable GPU, diminishing performance gains, considerable GPU memory, High resolution
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 3 figures, accepted by MICCAI 2024, the revised version

点击查看摘要

Abstract:High resolution is crucial for precise segmentation in fundus images, yet handling high-resolution inputs incurs considerable GPU memory costs, with diminishing performance gains as overhead increases. To address this issue while tackling the challenge of segmenting tiny objects, recent studies have explored local-global fusion methods. These methods preserve fine details using local regions and capture long-range context information from downscaled global images. However, the necessity of multiple forward passes inevitably incurs significant computational overhead, adversely affecting inference speed. In this paper, we propose HRDecoder, a simple High-Resolution Decoder network for fundus lesion segmentation. It integrates a high-resolution representation learning module to capture fine-grained local features and a high-resolution fusion module to fuse multi-scale predictions. Our method effectively improves the overall segmentation accuracy of fundus lesions while consuming reasonable memory and computational overhead, and maintaining satisfying inference speed. Experimental results on the IDRID and DDR datasets demonstrate the effectiveness of our method. Code is available at this https URL.

[CV-9] Face Reconstruction from Face Embeddings using Adapter to a Face Foundation Model

链接: https://arxiv.org/abs/2411.03960
作者: Hatef Otroshi Shahreza,Anjith George,Sébastien Marcel
关键词-EN: Face recognition, face recognition models, Face, face images, recognition systems extract
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Face recognition systems extract embedding vectors from face images and use these embeddings to verify or identify individuals. Face reconstruction attack (also known as template inversion) refers to reconstructing face images from face embeddings and using the reconstructed face image to enter a face recognition system. In this paper, we propose to use a face foundation model to reconstruct face images from the embeddings of a blackbox face recognition model. The foundation model is trained with 42M images to generate face images from the facial embeddings of a fixed face recognition model. We propose to use an adapter to translate target embeddings into the embedding space of the foundation model. The generated images are evaluated on different face recognition models and different datasets, demonstrating the effectiveness of our method to translate embeddings of different face recognition models. We also evaluate the transferability of reconstructed face images when attacking different face recognition models. Our experimental results show that our reconstructed face images outperform previous reconstruction attacks against face recognition models.

[CV-10] Act in Collusion: A Persistent Distributed Multi-Target Backdoor in Federated Learning

链接: https://arxiv.org/abs/2411.03926
作者: Tao Liu,Wu Yang,Chen Xu,Jiguang Lv,Huanran Wang,Yuhang Zhang,Shuchun Xu,Dapeng Man
关键词-EN: protect data privacy, Federated learning, data privacy, backdoor attacks due, paradigm designed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Federated learning, a novel paradigm designed to protect data privacy, is vulnerable to backdoor attacks due to its distributed nature. Current research often designs attacks based on a single attacker with a single backdoor, overlooking more realistic and complex threats in federated learning. We propose a more practical threat model for federated learning: the distributed multi-target backdoor. In this model, multiple attackers control different clients, embedding various triggers and targeting different classes, collaboratively implanting backdoors into the global model via central aggregation. Empirical validation shows that existing methods struggle to maintain the effectiveness of multiple backdoors in the global model. Our key insight is that similar backdoor triggers cause parameter conflicts and injecting new backdoors disrupts gradient directions, significantly weakening some backdoors performance. To solve this, we propose a Distributed Multi-Target Backdoor Attack (DMBA), ensuring efficiency and persistence of backdoors from different malicious clients. To avoid parameter conflicts, we design a multi-channel dispersed frequency trigger strategy to maximize trigger differences. To mitigate gradient interference, we introduce backdoor replay in local training to neutralize conflicting gradients. Extensive validation shows that 30 rounds after the attack, Attack Success Rates of three different backdoors from various clients remain above 93%. The code will be made publicly available after the review period.

[CV-11] Self-supervised Representation Learning for Cell Event Recognition through Time Arrow Prediction

链接: https://arxiv.org/abs/2411.03924
作者: Cangxiong Chen,Vinay P. Namboodiri,Julia E. Sero
关键词-EN: data poses challenges, microscopy data poses, fundamental in bioimaging, spatio-temporal nature, data poses
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The spatio-temporal nature of live-cell microscopy data poses challenges in the analysis of cell states which is fundamental in bioimaging. Deep-learning based segmentation or tracking methods rely on large amount of high quality annotations to work effectively. In this work, we explore an alternative solution: using feature maps obtained from self-supervised representation learning (SSRL) on time arrow prediction (TAP) for the downstream supervised task of cell event recognition. We demonstrate through extensive experiments and analysis that this approach can achieve better performance with limited annotation compared to models trained from end to end using fully supervised approach. Our analysis also provides insight into applications of the SSRL using TAP in live-cell microscopy.

[CV-12] FedRISE: Rating Induced Sign Election of Gradients for Byzantine Tolerant Federated Aggregation

链接: https://arxiv.org/abs/2411.03861
作者: Joseph Geo Benjamin,Mothilal Asokan,Mohammad Yaqub,Karthik Nandakumar
关键词-EN: common defense strategies, robust aggregator mechanism, common defense, defense strategies, strategies against model
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注: This is a work under submission/review process

点击查看摘要

Abstract:One of the most common defense strategies against model poisoning in federated learning is to employ a robust aggregator mechanism that makes the training more resilient. Many of the existing Byzantine robust aggregators provide theoretical guarantees and are empirically effective against certain categories of attacks. However, we observe that certain high-strength attacks can subvert the aggregator and collapse the training. In addition, most aggregators require identifying tolerant settings to converge. Impact of attacks becomes more pronounced when the number of Byzantines is near-majority, and becomes harder to evade if the attacker is omniscient with access to data, honest updates and aggregation methods. Motivated by these observations, we develop a robust aggregator called FedRISE for cross-silo FL that is consistent and less susceptible to poisoning updates by an omniscient attacker. The proposed method explicitly determines the optimal direction of each gradient through a sign-voting strategy that uses variance-reduced sparse gradients. We argue that vote weighting based on the cosine similarity of raw gradients is misleading, and we introduce a sign-based gradient valuation function that ignores the gradient magnitude. We compare our method against 8 robust aggregators under 6 poisoning attacks on 3 datasets and architectures. Our results show that existing robust aggregators collapse for at least some attacks under severe settings, while FedRISE demonstrates better robustness because of a stringent gradient inclusion formulation.

[CV-13] An Edge Computing-Based Solution for Real-Time Leaf Disease Classification using Thermal Imaging

链接: https://arxiv.org/abs/2411.03835
作者: Públio Elon Correa da Silva,Jurandy Almeida
关键词-EN: improving food safety, improving crop health, crop health monitoring, technologies can transform, monitoring and management
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep learning (DL) technologies can transform agriculture by improving crop health monitoring and management, thus improving food safety. In this paper, we explore the potential of edge computing for real-time classification of leaf diseases using thermal imaging. We present a thermal image dataset for plant disease classification and evaluate deep learning models, including InceptionV3, MobileNetV1, MobileNetV2, and VGG-16, on resource-constrained devices like the Raspberry Pi 4B. Using pruning and quantization-aware training, these models achieve inference times up to 1.48x faster on Edge TPU Max for VGG16, and up to 2.13x faster with precision reduction on Intel NCS2 for MobileNetV1, compared to high-end GPUs like the RTX 3090, while maintaining state-of-the-art accuracy.

[CV-14] An Enhancement of Haar Cascade Algorithm Applied to Face Recognition for Gate Pass Security

链接: https://arxiv.org/abs/2411.03831
作者: Clarence A. Antipona,Romeo R. Magsino,Raymund M. Dioses,Khatalyn E. Mata
关键词-EN: Haar Cascade Algorithm, Haar Cascade, Enhanced Haar Cascade, Cascade Algorithm, Cascade
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This study is focused on enhancing the Haar Cascade Algorithm to decrease the false positive and false negative rate in face matching and face detection to increase the accuracy rate even under challenging conditions. The face recognition library was implemented with Haar Cascade Algorithm in which the 128-dimensional vectors representing the unique features of a face are encoded. A subprocess was applied where the grayscale image from Haar Cascade was converted to RGB to improve the face encoding. Logical process and face filtering are also used to decrease non-face detection. The Enhanced Haar Cascade Algorithm produced a 98.39% accuracy rate (21.39% increase), 63.59% precision rate, 98.30% recall rate, and 72.23% in F1 Score. In comparison, the Haar Cascade Algorithm achieved a 46.70% to 77.00% accuracy rate, 44.15% precision rate, 98.61% recall rate, and 47.01% in F1 Score. Both algorithms used the Confusion Matrix Test with 301,950 comparisons using the same dataset of 550 images. The 98.39% accuracy rate shows a significant decrease in false positive and false negative rates in facial recognition. Face matching and face detection are more accurate in images with complex backgrounds, lighting variations, and occlusions, or even those with similar attributes.

[CV-15] Generalize or Detect? Towards Robust Semantic Segmentation Under Multiple Distribution Shifts NEURIPS2024

链接: https://arxiv.org/abs/2411.03829
作者: Zhitong Gao,Bingnan Li,Mathieu Salzmann,Xuming He
关键词-EN: detect anomaly classes, ideal segmentation model, open-world scenarios, ideal segmentation, anomaly classes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Published in NeurIPS 2024

点击查看摘要

Abstract:In open-world scenarios, where both novel classes and domains may exist, an ideal segmentation model should detect anomaly classes for safety and generalize to new domains. However, existing methods often struggle to distinguish between domain-level and semantic-level distribution shifts, leading to poor out-of-distribution (OOD) detection or domain generalization performance. In this work, we aim to equip the model to generalize effectively to covariate-shift regions while precisely identifying semantic-shift regions. To achieve this, we design a novel generative augmentation method to produce coherent images that incorporate both anomaly (or novel) objects and various covariate shifts at both image and object levels. Furthermore, we introduce a training strategy that recalibrates uncertainty specifically for semantic shifts and enhances the feature extractor to align features associated with domain shifts. We validate the effectiveness of our method across benchmarks featuring both semantic and domain shifts. Our method achieves state-of-the-art performance across all benchmarks for both OOD detection and domain generalization. Code is available at this https URL.

[CV-16] SA3DIP: Segment Any 3D Instance with Potential 3D Priors

链接: https://arxiv.org/abs/2411.03819
作者: Xi Yang,Xu Gu,Xingyilang Yin,Xinbo Gao
关键词-EN: sparked research, research into adapting, foundation models, instance segmentation, segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The proliferation of 2D foundation models has sparked research into adapting them for open-world 3D instance segmentation. Recent methods introduce a paradigm that leverages superpoints as geometric primitives and incorporates 2D multi-view masks from Segment Anything model (SAM) as merging guidance, achieving outstanding zero-shot instance segmentation results. However, the limited use of 3D priors restricts the segmentation performance. Previous methods calculate the 3D superpoints solely based on estimated normal from spatial coordinates, resulting in under-segmentation for instances with similar geometry. Besides, the heavy reliance on SAM and hand-crafted algorithms in 2D space suffers from over-segmentation due to SAM’s inherent part-level segmentation tendency. To address these issues, we propose SA3DIP, a novel method for Segmenting Any 3D Instances via exploiting potential 3D Priors. Specifically, on one hand, we generate complementary 3D primitives based on both geometric and textural priors, which reduces the initial errors that accumulate in subsequent procedures. On the other hand, we introduce supplemental constraints from the 3D space by using a 3D detector to guide a further merging process. Furthermore, we notice a considerable portion of low-quality ground truth annotations in ScanNetV2 benchmark, which affect the fair evaluations. Thus, we present ScanNetV2-INS with complete ground truth labels and supplement additional instances for 3D class-agnostic instance segmentation. Experimental evaluations on various 2D-3D datasets demonstrate the effectiveness and robustness of our approach. Our code and proposed ScanNetV2-INS dataset are available HERE.

[CV-17] Harmformer: Harmonic Networks Meet Transformers for Continuous Roto-Translation Equivariance NEURIPS2024

链接: https://arxiv.org/abs/2411.03794
作者: Tomáš Karella,Adam Harmanec,Jan Kotera,Jan Blažek,Filip Šroubek
关键词-EN: CNNs exhibit inherent, faster learning, CNNs exhibit, leading to efficient, data usage
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Appears in NeurIPS 2024 Workshop on Symmetry and Geometry in Neural Representations

点击查看摘要

Abstract:CNNs exhibit inherent equivariance to image translation, leading to efficient parameter and data usage, faster learning, and improved robustness. The concept of translation equivariant networks has been successfully extended to rotation transformation using group convolution for discrete rotation groups and harmonic functions for the continuous rotation group encompassing 360^\circ . We explore the compatibility of the SA mechanism with full rotation equivariance, in contrast to previous studies that focused on discrete rotation. We introduce the Harmformer, a harmonic transformer with a convolutional stem that achieves equivariance for both translation and continuous rotation. Accompanied by an end-to-end equivariance proof, the Harmformer not only outperforms previous equivariant transformers, but also demonstrates inherent stability under any continuous rotation, even without seeing rotated samples during training.

[CV-18] Deferred Poisoning: Making the Model More Vulnerable via Hessian Singularization

链接: https://arxiv.org/abs/2411.03752
作者: Yuhao He,Jinyu Tian,Xianwei Zheng,Li Dong,Yuanman Li,Leo Yu Zhang,Jiantao Zhou
关键词-EN: Recent studies, deep learning models, poisoning attack, studies have shown, shown that deep
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent studies have shown that deep learning models are very vulnerable to poisoning attacks. Many defense methods have been proposed to address this issue. However, traditional poisoning attacks are not as threatening as commonly believed. This is because they often cause differences in how the model performs on the training set compared to the validation set. Such inconsistency can alert defenders that their data has been poisoned, allowing them to take the necessary defensive actions. In this paper, we introduce a more threatening type of poisoning attack called the Deferred Poisoning Attack. This new attack allows the model to function normally during the training and validation phases but makes it very sensitive to evasion attacks or even natural noise. We achieve this by ensuring the poisoned model’s loss function has a similar value as a normally trained model at each input sample but with a large local curvature. A similar model loss ensures that there is no obvious inconsistency between the training and validation accuracy, demonstrating high stealthiness. On the other hand, the large curvature implies that a small perturbation may cause a significant increase in model loss, leading to substantial performance degradation, which reflects a worse robustness. We fulfill this purpose by making the model have singular Hessian information at the optimal point via our proposed Singularization Regularization term. We have conducted both theoretical and empirical analyses of the proposed method and validated its effectiveness through experiments on image classification tasks. Furthermore, we have confirmed the hazards of this form of poisoning attack under more general scenarios using natural noise, offering a new perspective for research in the field of security.

[CV-19] Homotopy Continuation Made Easy: Regression-based Online Simulation of Starting Problem-Solution Pairs

链接: https://arxiv.org/abs/2411.03745
作者: Xinyue Zhang,Zijia Dai,Wanting Xu,Laurent Kneip
关键词-EN: automatically generated polynomial, generated polynomial elimination, polynomial elimination templates, sparked great progress, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:While automatically generated polynomial elimination templates have sparked great progress in the field of 3D computer vision, there remain many problems for which the degree of the constraints or the number of unknowns leads to intractability. In recent years, homotopy continuation has been introduced as a plausible alternative. However, the method currently depends on expensive parallel tracking of all possible solutions in the complex domain, or a classification network for starting problem-solution pairs trained over a limited set of real-world examples. Our innovation consists of employing a regression network trained in simulation to directly predict a solution from input correspondences, followed by an online simulator that invents a consistent problem-solution pair. Subsequently, homotopy continuation is applied to track that single solution back to the original problem. We apply this elegant combination to generalized camera resectioning, and also introduce a new solution to the challenging generalized relative pose and scale problem. As demonstrated, the proposed method successfully compensates the raw error committed by the regressor alone, and leads to state-of-the-art efficiency and success rates while running on CPU resources, only.

[CV-20] NeurIPS 2023 Competition: Privacy Preserving Federated Learning Document VQA

链接: https://arxiv.org/abs/2411.03730
作者: Marlon Tobaben,Mohamed Ali Souibgui,Rubèn Tito,Khanh Nguyen,Raouf Kerkouche,Kangsoo Jung,Joonas Jälkö,Lei Kang,Andrey Barsky,Vincent Poulain d’Andecy,Aurélie Joseph,Aashiq Muhamed,Kevin Kuo,Virginia Smith,Yusuke Yamasaki,Takumi Fukami,Kenta Niwa,Iifan Tyou,Hiro Ishii,Rio Yokota,Ragul N,Rintu Kutum,Josep Llados,Ernest Valveny,Antti Honkela,Mario Fritz,Dimosthenis Karatzas
关键词-EN: Learning Document VQA, Preserving Federated Learning, Privacy Preserving Federated, Document VQA, develop provably private
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: 27 pages, 6 figures

点击查看摘要

Abstract:The Privacy Preserving Federated Learning Document VQA (PFL-DocVQA) competition challenged the community to develop provably private and communication-efficient solutions in a federated setting for a real-life use case: invoice processing. The competition introduced a dataset of real invoice documents, along with associated questions and answers requiring information extraction and reasoning over the document images. Thereby, it brings together researchers and expertise from the document analysis, privacy, and federated learning communities. Participants fine-tuned a pre-trained, state-of-the-art Document Visual Question Answering model provided by the organizers for this new domain, mimicking a typical federated invoice processing setup. The base model is a multi-modal generative language model, and sensitive information could be exposed through either the visual or textual input modality. Participants proposed elegant solutions to reduce communication costs while maintaining a minimum utility threshold in track 1 and to protect all information from each document provider using differential privacy in track 2. The competition served as a new testbed for developing and testing private federated learning methods, simultaneously raising awareness about privacy within the document image analysis and recognition community. Ultimately, the competition analysis provides best practices and recommendations for successfully running privacy-focused federated learning challenges in the future.

[CV-21] Efficient Fourier Filtering Network with Contrastive Learning for UAV-based Unaligned Bi-modal Salient Object Detection

链接: https://arxiv.org/abs/2411.03728
作者: Pengfei Lyu,Pak-Hei Yeung,Xiufei Cheng,Xiaosheng Yu,Chengdong Wu,Jagath C. Rajapakse
关键词-EN: Unmanned aerial vehicle, salient object detection, segment salient objects, thermal image pairs, scene utilizing complementary
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:Unmanned aerial vehicle (UAV)-based bi-modal salient object detection (BSOD) aims to segment salient objects in a scene utilizing complementary cues in unaligned RGB and thermal image pairs. However, the high computational expense of existing UAV-based BSOD models limits their applicability to real-world UAV devices. To address this problem, we propose an efficient Fourier filter network with contrastive learning that achieves both real-time and accurate performance. Specifically, we first design a semantic contrastive alignment loss to align the two modalities at the semantic level, which facilitates mutual refinement in a parameter-free way. Second, inspired by the fast Fourier transform that obtains global relevance in linear complexity, we propose synchronized alignment fusion, which aligns and fuses bi-modal features in the channel and spatial dimensions by a hierarchical filtering mechanism. Our proposed model, AlignSal, reduces the number of parameters by 70.0%, decreases the floating point operations by 49.4%, and increases the inference speed by 152.5% compared to the cutting-edge BSOD model (i.e., MROS). Extensive experiments on the UAV RGB-T 2400 and three weakly aligned datasets demonstrate that AlignSal achieves both real-time inference speed and better performance and generalizability compared to sixteen state-of-the-art BSOD models across most evaluation metrics. In addition, our ablation studies further verify AlignSal’s potential in boosting the performance of existing aligned BSOD models on UAV-based unaligned data. The code is available at: this https URL.

[CV-22] PX2Tooth: Reconstructing the 3D Point Cloud Teeth from a Single Panoramic X-ray

链接: https://arxiv.org/abs/2411.03725
作者: Wen Ma,Huikai Wu,Zikai Xiao,Yang Feng,Jian Wu,Zuozhu Liu
关键词-EN: effectively reduce radiation, reduce radiation risks, anatomical structures, oral cavity, remains a critical
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Ma W, Wu H, Xiao Z, et al. PX2Tooth: Reconstructing the 3D Point Cloud Teeth from a Single Panoramic X-Ray[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2024: 411-421

点击查看摘要

Abstract:Reconstructing the 3D anatomical structures of the oral cavity, which originally reside in the cone-beam CT (CBCT), from a single 2D Panoramic X-ray(PX) remains a critical yet challenging task, as it can effectively reduce radiation risks and treatment costs during the diagnostic in digital dentistry. However, current methods are either error-prone or only trained/evaluated on small-scale datasets (less than 50 cases), resulting in compromised trustworthiness. In this paper, we propose PX2Tooth, a novel approach to reconstruct 3D teeth using a single PX image with a two-stage framework. First, we design the PXSegNet to segment the permanent teeth from the PX images, providing clear positional, morphological, and categorical information for each tooth. Subsequently, we design a novel tooth generation network (TGNet) that learns to transform random point clouds into 3D teeth. TGNet integrates the segmented patch information and introduces a Prior Fusion Module (PFM) to enhance the generation quality, especially in the root apex region. Moreover, we construct a dataset comprising 499 pairs of CBCT and Panoramic X-rays. Extensive experiments demonstrate that PX2Tooth can achieve an Intersection over Union (IoU) of 0.793, significantly surpassing previous methods, underscoring the great potential of artificial intelligence in digital dentistry.

[CV-23] Estimation of Psychosocial Work Environment Exposures Through Video Object Detection. Proof of Concept Using CCTV Footage

链接: https://arxiv.org/abs/2411.03724
作者: Claus D. Hansen,Thuy Hai Le,David Campos
关键词-EN: computer vision algorithms, CCTV footage, paper examines, computer vision, estimate aspects
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 9 figures, presented at IWOAR 9th International Workshop on Sensor-Based Activity Recognition and Artificial Intelligence, September 26-27, Potsdam, Germany

点击查看摘要

Abstract:This paper examines the use of computer vision algorithms to estimate aspects of the psychosocial work environment using CCTV footage. We present a proof of concept for a methodology that detects and tracks people in video footage and estimates interactions between customers and employees by estimating their poses and calculating the duration of their encounters. We propose a pipeline that combines existing object detection and tracking algorithms (YOLOv8 and DeepSORT) with pose estimation algorithms (BlazePose) to estimate the number of customers and employees in the footage as well as the duration of their encounters. We use a simple rule-based approach to classify the interactions as positive, neutral or negative based on three different criteria: distance, duration and pose. The proposed methodology is tested on a small dataset of CCTV footage. While the data is quite limited in particular with respect to the quality of the footage, we have chosen this case as it represents a typical setting where the method could be applied. The results show that the object detection and tracking part of the pipeline has a reasonable performance on the dataset with a high degree of recall and reasonable accuracy. At this stage, the pose estimation is still limited to fully detect the type of interactions due to difficulties in tracking employees in the footage. We conclude that the method is a promising alternative to self-reported measures of the psychosocial work environment and could be used in future studies to obtain external observations of the work environment.

[CV-24] hese Maps Are Made by Propagation: Adapting Deep Stereo Networks to Road Scenarios with Decisive Disparity Diffusion

链接: https://arxiv.org/abs/2411.03717
作者: Chuang-Wei Liu,Yikang Zhang,Qijun Chen,Ioannis Pitas,Rui Fan
关键词-EN: garnering significant attention, garnering significant, efficiency and accuracy, cost-effective solution, significant attention
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 7 figures

点击查看摘要

Abstract:Stereo matching has emerged as a cost-effective solution for road surface 3D reconstruction, garnering significant attention towards improving both computational efficiency and accuracy. This article introduces decisive disparity diffusion (D3Stereo), marking the first exploration of dense deep feature matching that adapts pre-trained deep convolutional neural networks (DCNNs) to previously unseen road scenarios. A pyramid of cost volumes is initially created using various levels of learned representations. Subsequently, a novel recursive bilateral filtering algorithm is employed to aggregate these costs. A key innovation of D3Stereo lies in its alternating decisive disparity diffusion strategy, wherein intra-scale diffusion is employed to complete sparse disparity images, while inter-scale inheritance provides valuable prior information for higher resolutions. Extensive experiments conducted on our created UDTIRI-Stereo and Stereo-Road datasets underscore the effectiveness of D3Stereo strategy in adapting pre-trained DCNNs and its superior performance compared to all other explicit programming-based algorithms designed specifically for road surface 3D reconstruction. Additional experiments conducted on the Middlebury dataset with backbone DCNNs pre-trained on the ImageNet database further validate the versatility of D3Stereo strategy in tackling general stereo matching problems.

[CV-25] Explaining Human Activity Recognition with SHAP: Validating Insights with Perturbation and Quantitative Measures

链接: https://arxiv.org/abs/2411.03714
作者: Felix Tempel,Espen Alexander F. Ihlen,Lars Adde,Inga Strümke
关键词-EN: Human Activity Recognition, Human Activity, Graph Convolution Networks, body key points, Activity Recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In Human Activity Recognition (HAR), understanding the intricacy of body movements within high-risk applications is essential. This study uses SHapley Additive exPlanations (SHAP) to explain the decision-making process of Graph Convolution Networks (GCNs) when classifying activities with skeleton data. We employ SHAP to explain two real-world datasets: one for cerebral palsy (CP) classification and the widely used NTU RGB+D 60 action recognition dataset. To test the explanation, we introduce a novel perturbation approach that modifies the model’s edge importance matrix, allowing us to evaluate the impact of specific body key points on prediction outcomes. To assess the fidelity of our explanations, we employ informed perturbation, targeting body key points identified as important by SHAP and comparing them against random perturbation as a control condition. This perturbation enables a judgment on whether the body key points are truly influential or non-influential based on the SHAP values. Results on both datasets show that body key points identified as important through SHAP have the largest influence on the accuracy, specificity, and sensitivity metrics. Our findings highlight that SHAP can provide granular insights into the input feature contribution to the prediction outcome of GCNs in HAR tasks. This demonstrates the potential for more interpretable and trustworthy models in high-stakes applications like healthcare or rehabilitation.

[CV-26] 3DGS-CD: 3D Gaussian Splatting-based Change Detection for Physical Object Rearrangement

链接: https://arxiv.org/abs/2411.03706
作者: Ziqi Lu,Jianbo Ye,John Leonard
关键词-EN: Gaussian Splatting, physical object rearrangements, detecting physical object, detecting physical, Gaussian
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:We present 3DGS-CD, the first 3D Gaussian Splatting (3DGS)-based method for detecting physical object rearrangements in 3D scenes. Our approach estimates 3D object-level changes by comparing two sets of unaligned images taken at different times. Leveraging 3DGS’s novel view rendering and EfficientSAM’s zero-shot segmentation capabilities, we detect 2D object-level changes, which are then associated and fused across views to estimate 3D changes. Our method can detect changes in cluttered environments using sparse post-change images within as little as 18s, using as few as a single new image. It does not rely on depth input, user instructions, object classes, or object models – An object is recognized simply if it has been re-arranged. Our approach is evaluated on both public and self-collected real-world datasets, achieving up to 14% higher accuracy and three orders of magnitude faster performance compared to the state-of-the-art radiance-field-based change detection method. This significant performance boost enables a broad range of downstream applications, where we highlight three key use cases: object reconstruction, robot workspace reset, and 3DGS model update. Our code and data will be made available at this https URL.

[CV-27] Graph-Based Multi-Modal Sensor Fusion for Autonomous Driving ICPR’24

链接: https://arxiv.org/abs/2411.03702
作者: Depanshu Sani,Saket Anand
关键词-EN: integrating multiple sensing, growing demand, demand for robust, mobile robotics, highlighted the importance
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: An extended abstract accepted at Young Researchers’ Symposium, ICVGIP '24. This extended abstract contains the following: 1. Short summary of our work, SAGA-KF, accepted at ICPR’24. 2. A proposal that was awarded the Qualcomm Innovation Fellowship’24

点击查看摘要

Abstract:The growing demand for robust scene understanding in mobile robotics and autonomous driving has highlighted the importance of integrating multiple sensing modalities. By combining data from diverse sensors like cameras and LIDARs, fusion techniques can overcome the limitations of individual sensors, enabling a more complete and accurate perception of the environment. We introduce a novel approach to multi-modal sensor fusion, focusing on developing a graph-based state representation that supports critical decision-making processes in autonomous driving. We present a Sensor-Agnostic Graph-Aware Kalman Filter [3], the first online state estimation technique designed to fuse multi-modal graphs derived from noisy multi-sensor data. The estimated graph-based state representations serve as a foundation for advanced applications like Multi-Object Tracking (MOT), offering a comprehensive framework for enhancing the situational awareness and safety of autonomous systems. We validate the effectiveness of our proposed framework through extensive experiments conducted on both synthetic and real-world driving datasets (nuScenes). Our results showcase an improvement in MOTA and a reduction in estimated position errors (MOTP) and identity switches (IDS) for tracked objects using the SAGA-KF. Furthermore, we highlight the capability of such a framework to develop methods that can leverage heterogeneous information (like semantic objects and geometric structures) from various sensing modalities, enabling a more holistic approach to scene understanding and enhancing the safety and effectiveness of autonomous systems.

[CV-28] OccLoff: Learning Optimized Feature Fusion for 3D Occupancy Prediction

链接: https://arxiv.org/abs/2411.03696
作者: Ji Zhang,Yiran Ding,Zixin Liu
关键词-EN: surrounding environment, autonomous driving, crucial for finely, finely representing, representing the surrounding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D semantic occupancy prediction is crucial for finely representing the surrounding environment, which is essential for ensuring the safety in autonomous driving. Existing fusion-based occupancy methods typically involve performing a 2D-to-3D view transformation on image features, followed by computationally intensive 3D operations to fuse these with LiDAR features, leading to high computational costs and reduced accuracy. Moreover, current research on occupancy prediction predominantly focuses on designing specific network architectures, often tailored to particular models, with limited attention given to the more fundamental aspect of semantic feature learning. This gap hinders the development of more transferable methods that could enhance the performance of various occupancy models. To address these challenges, we propose OccLoff, a framework that Learns to Optimize Feature Fusion for 3D occupancy prediction. Specifically, we introduce a sparse fusion encoder with entropy masks that directly fuses 3D and 2D features, improving model accuracy while reducing computational overhead. Additionally, we propose a transferable proxy-based loss function and an adaptive hard sample weighting algorithm, which enhance the performance of several state-of-the-art methods. Extensive evaluations on the nuScenes and SemanticKITTI benchmarks demonstrate the superiority of our framework, and ablation studies confirm the effectiveness of each proposed module.

[CV-29] AMNCutter: Affinity-Attention-Guided Multi-View Normalized Cutter for Unsupervised Surgical Instrument Segmentation WACV

链接: https://arxiv.org/abs/2411.03695
作者: Mingyu Sheng,Jianan Fan,Dongnan Liu,Ron Kikinis,Weidong Cai
关键词-EN: Surgical instrument segmentation, minimally invasive surgery, identifying surgical instruments, robotic-assisted minimally invasive, endoscopic video frames
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper was accepted by the 2025 IEEE Winter Conference on Applications of Computer Vision (WACV)

点击查看摘要

Abstract:Surgical instrument segmentation (SIS) is pivotal for robotic-assisted minimally invasive surgery, assisting surgeons by identifying surgical instruments in endoscopic video frames. Recent unsupervised surgical instrument segmentation (USIS) methods primarily rely on pseudo-labels derived from low-level features such as color and optical flow, but these methods show limited effectiveness and generalizability in complex and unseen endoscopic scenarios. In this work, we propose a label-free unsupervised model featuring a novel module named Multi-View Normalized Cutter (m-NCutter). Different from previous USIS works, our model is trained using a graph-cutting loss function that leverages patch affinities for supervision, eliminating the need for pseudo-labels. The framework adaptively determines which affinities from which levels should be prioritized. Therefore, the low- and high-level features and their affinities are effectively integrated to train a label-free unsupervised model, showing superior effectiveness and generalization ability. We conduct comprehensive experiments across multiple SIS datasets to validate our approach’s state-of-the-art (SOTA) performance, robustness, and exceptional potential as a pre-trained model. Our code is released at this https URL.

[CV-30] Where Do We Stand with Implicit Neural Representations? A Technical and Performance Survey

链接: https://arxiv.org/abs/2411.03688
作者: Amer Essakine,Yanqi Cheng,Chun-Wun Cheng,Lipei Zhang,Zhongying Deng,Lei Zhu,Carola-Bibiane Schönlieb,Angelica I Aviles-Rivero
关键词-EN: Implicit Neural Representations, Implicit Neural, Neural Representations, offering exceptional flexibility, continuous implicit functions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Implicit Neural Representations (INRs) have emerged as a paradigm in knowledge representation, offering exceptional flexibility and performance across a diverse range of applications. INRs leverage multilayer perceptrons (MLPs) to model data as continuous implicit functions, providing critical advantages such as resolution independence, memory efficiency, and generalisation beyond discretised data structures. Their ability to solve complex inverse problems makes them particularly effective for tasks including audio reconstruction, image representation, 3D object reconstruction, and high-dimensional data synthesis. This survey provides a comprehensive review of state-of-the-art INR methods, introducing a clear taxonomy that categorises them into four key areas: activation functions, position encoding, combined strategies, and network structure optimisation. We rigorously analyse their critical properties, such as full differentiability, smoothness, compactness, and adaptability to varying resolutions while also examining their strengths and limitations in addressing locality biases and capturing fine details. Our experimental comparison offers new insights into the trade-offs between different approaches, showcasing the capabilities and challenges of the latest INR techniques across various tasks. In addition to identifying areas where current methods excel, we highlight key limitations and potential avenues for improvement, such as developing more expressive activation functions, enhancing positional encoding mechanisms, and improving scalability for complex, high-dimensional data. This survey serves as a roadmap for researchers, offering practical guidance for future exploration in the field of INRs. We aim to foster new methodologies by outlining promising research directions for INRs and applications.

[CV-31] Structure Consistent Gaussian Splatting with Matching Prior for Few-shot Novel View Synthesis NEURIPS2024

链接: https://arxiv.org/abs/2411.03637
作者: Rui Peng,Wangze Xu,Luyang Tang,Liwei Liao,Jianbo Jiao,Ronggang Wang
关键词-EN: Neural Radiance Fields, Radiance Fields, Neural Radiance, suffer significant degradation, Consistent Gaussian Splatting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024 Accepted

点击查看摘要

Abstract:Despite the substantial progress of novel view synthesis, existing methods, either based on the Neural Radiance Fields (NeRF) or more recently 3D Gaussian Splatting (3DGS), suffer significant degradation when the input becomes sparse. Numerous efforts have been introduced to alleviate this problem, but they still struggle to synthesize satisfactory results efficiently, especially in the large scene. In this paper, we propose SCGaussian, a Structure Consistent Gaussian Splatting method using matching priors to learn 3D consistent scene structure. Considering the high interdependence of Gaussian attributes, we optimize the scene structure in two folds: rendering geometry and, more importantly, the position of Gaussian primitives, which is hard to be directly constrained in the vanilla 3DGS due to the non-structure property. To achieve this, we present a hybrid Gaussian representation. Besides the ordinary non-structure Gaussian primitives, our model also consists of ray-based Gaussian primitives that are bound to matching rays and whose optimization of their positions is restricted along the ray. Thus, we can utilize the matching correspondence to directly enforce the position of these Gaussian primitives to converge to the surface points where rays intersect. Extensive experiments on forward-facing, surrounding, and complex large scenes show the effectiveness of our approach with state-of-the-art performance and high efficiency. Code is available at this https URL.

[CV-32] LCP-Fusion: A Neural Implicit SLAM with Enhanced Local Constraints and Computable Prior IROS2024

链接: https://arxiv.org/abs/2411.03610
作者: Jiahui Wang,Yinan Deng,Yi Yang,Yufeng Yue
关键词-EN: dense Simultaneous Localization, shown impressive progress, Recently the dense, dense Simultaneous, Simultaneous Localization
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024)

点击查看摘要

Abstract:Recently the dense Simultaneous Localization and Mapping (SLAM) based on neural implicit representation has shown impressive progress in hole filling and high-fidelity mapping. Nevertheless, existing methods either heavily rely on known scene bounds or suffer inconsistent reconstruction due to drift in potential loop-closure regions, or both, which can be attributed to the inflexible representation and lack of local constraints. In this paper, we present LCP-Fusion, a neural implicit SLAM system with enhanced local constraints and computable prior, which takes the sparse voxel octree structure containing feature grids and SDF priors as hybrid scene representation, enabling the scalability and robustness during mapping and tracking. To enhance the local constraints, we propose a novel sliding window selection strategy based on visual overlap to address the loop-closure, and a practical warping loss to constrain relative poses. Moreover, we estimate SDF priors as coarse initialization for implicit features, which brings additional explicit constraints and robustness, especially when a light but efficient adaptive early ending is adopted. Experiments demonstrate that our method achieve better localization accuracy and reconstruction consistency than existing RGB-D implicit SLAM, especially in challenging real scenes (ScanNet) as well as self-captured scenes with unknown scene bounds. The code is available at this https URL.

[CV-33] Estimating Ego-Body Pose from Doubly Sparse Egocentric Video Data NEURIPS2024

链接: https://arxiv.org/abs/2411.03561
作者: Seunggeun Chi,Pin-Hao Huang,Enna Sachdeva,Hengbo Ma,Karthik Ramani,Kwonjoon Lee
关键词-EN: camera wearer, egocentric videos, wearer from egocentric, hand, pose
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at NeurIPS 2024

点击查看摘要

Abstract:We study the problem of estimating the body movements of a camera wearer from egocentric videos. Current methods for ego-body pose estimation rely on temporally dense sensor data, such as IMU measurements from spatially sparse body parts like the head and hands. However, we propose that even temporally sparse observations, such as hand poses captured intermittently from egocentric videos during natural or periodic hand movements, can effectively constrain overall body motion. Naively applying diffusion models to generate full-body pose from head pose and sparse hand pose leads to suboptimal results. To overcome this, we develop a two-stage approach that decomposes the problem into temporal completion and spatial completion. First, our method employs masked autoencoders to impute hand trajectories by leveraging the spatiotemporal correlations between the head pose sequence and intermittent hand poses, providing uncertainty estimates. Subsequently, we employ conditional diffusion models to generate plausible full-body motions based on these temporally dense trajectories of the head and hands, guided by the uncertainty estimates from the imputation. The effectiveness of our method was rigorously tested and validated through comprehensive experiments conducted on various HMD setup with AMASS and Ego-Exo4D datasets.

[CV-34] Object and Contact Point Tracking in Demonstrations Using 3D Gaussian Splatting

链接: https://arxiv.org/abs/2411.03555
作者: Michael Büttner,Jonathan Francis,Helge Rhodin,Andrew Melnik
关键词-EN: enhance Interactive Imitation, Interactive Imitation Learning, extracting touch interaction, touch interaction points, Interactive Imitation
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: CoRL 2024, Workshop on Lifelong Learning for Home Robots, Munich, Germany

点击查看摘要

Abstract:This paper introduces a method to enhance Interactive Imitation Learning (IIL) by extracting touch interaction points and tracking object movement from video demonstrations. The approach extends current IIL systems by providing robots with detailed knowledge of both where and how to interact with objects, particularly complex articulated ones like doors and drawers. By leveraging cutting-edge techniques such as 3D Gaussian Splatting and FoundationPose for tracking, this method allows robots to better understand and manipulate objects in dynamic environments. The research lays the foundation for more effective task learning and execution in autonomous robotic systems.

[CV-35] Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset

链接: https://arxiv.org/abs/2411.03554
作者: Yingzi Ma,Jiongxiao Wang,Fei Wang,Siyuan Ma,Jiazhao Li,Xiujun Li,Furong Huang,Lichao Sun,Bo Li,Yejin Choi,Muhao Chen,Chaowei Xiao
关键词-EN: Vision Language Models, VLM unlearning, forgetting specific information, Machine unlearning, VLM unlearning algorithms
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Machine unlearning has emerged as an effective strategy for forgetting specific information in the training data. However, with the increasing integration of visual data, privacy concerns in Vision Language Models (VLMs) remain underexplored. To address this, we introduce Facial Identity Unlearning Benchmark (FIUBench), a novel VLM unlearning benchmark designed to robustly evaluate the effectiveness of unlearning algorithms under the Right to be Forgotten setting. Specifically, we formulate the VLM unlearning task via constructing the Fictitious Facial Identity VQA dataset and apply a two-stage evaluation pipeline that is designed to precisely control the sources of information and their exposure levels. In terms of evaluation, since VLM supports various forms of ways to ask questions with the same semantic meaning, we also provide robust evaluation metrics including membership inference attacks and carefully designed adversarial privacy attacks to evaluate the performance of algorithms. Through the evaluation of four baseline VLM unlearning algorithms within FIUBench, we find that all methods remain limited in their unlearning performance, with significant trade-offs between model utility and forget quality. Furthermore, our findings also highlight the importance of privacy attacks for robust evaluations. We hope FIUBench will drive progress in developing more effective VLM unlearning algorithms.

[CV-36] Beyond Complete Shapes: A quantitative Evaluation of 3D Shape Matching Algorithms

链接: https://arxiv.org/abs/2411.03511
作者: Viktoria Ehm,Nafie El Amrani,Yizheng Xie,Lennart Bastian,Maolin Gao,Weikang Wang,Lu Sang,Dongliang Cao,Zorah Lähner,Daniel Cremers,Florian Bernard
关键词-EN: Finding correspondences, computer vision, long-standing problem, problem in computer, shape matching
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Finding correspondences between 3D shapes is an important and long-standing problem in computer vision, graphics and beyond. While approaches based on machine learning dominate modern 3D shape matching, almost all existing (learning-based) methods require that at least one of the involved shapes is complete. In contrast, the most challenging and arguably most practically relevant setting of matching partially observed shapes, is currently underexplored. One important factor is that existing datasets contain only a small number of shapes (typically below 100), which are unable to serve data-hungry machine learning approaches, particularly in the unsupervised regime. In addition, the type of partiality present in existing datasets is often artificial and far from realistic. To address these limitations and to encourage research on these relevant settings, we provide a generic and flexible framework for the procedural generation of challenging partial shape matching scenarios. Our framework allows for a virtually infinite generation of partial shape matching instances from a finite set of shapes with complete geometry. Further, we manually create cross-dataset correspondences between seven existing (complete geometry) shape matching datasets, leading to a total of 2543 shapes. Based on this, we propose several challenging partial benchmark settings, for which we evaluate respective state-of-the-art methods as baselines.

[CV-37] SynthSet: Generative Diffusion Model for Semantic Segmentation in Precision Agriculture

链接: https://arxiv.org/abs/2411.03505
作者: Andrew Heschl,Mauricio Murillo,Keyhan Najafian,Farhad Maleki
关键词-EN: Generative Adversarial Networks, Denoising Diffusion Probabilistic, Diffusion Probabilistic Models, synthetic annotated data, generating synthetic annotated
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper introduces a methodology for generating synthetic annotated data to address data scarcity in semantic segmentation tasks within the precision agriculture domain. Utilizing Denoising Diffusion Probabilistic Models (DDPMs) and Generative Adversarial Networks (GANs), we propose a dual diffusion model architecture for synthesizing realistic annotated agricultural data, without any human intervention. We employ super-resolution to enhance the phenotypic characteristics of the synthesized images and their coherence with the corresponding generated masks. We showcase the utility of the proposed method for wheat head segmentation. The high quality of synthesized data underscores the effectiveness of the proposed methodology in generating image-mask pairs. Furthermore, models trained on our generated data exhibit promising performance when tested on an external, diverse dataset of real wheat fields. The results show the efficacy of the proposed methodology for addressing data scarcity for semantic segmentation tasks. Moreover, the proposed approach can be readily adapted for various segmentation tasks in precision agriculture and beyond.

[CV-38] An Application-Agnostic Automatic Target Recognition System Using Vision Language Models

链接: https://arxiv.org/abs/2411.03491
作者: Anthony Palladino,Dana Gajewski,Abigail Aronica,Patryk Deptula,Alexander Hamme,Seiyoung C. Lee,Jeff Muri,Todd Nelling,Michael A. Riley,Brian Wong,Margaret Duff
关键词-EN: Automatic Target Recognition, open-vocabulary object detection, Target Recognition, Automatic Target, classification models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to the Thirty-Seventh Annual Conference on Innovative Applications of Artificial Intelligence (IAAI-25)

点击查看摘要

Abstract:We present a novel Automatic Target Recognition (ATR) system using open-vocabulary object detection and classification models. A primary advantage of this approach is that target classes can be defined just before runtime by a non-technical end user, using either a few natural language text descriptions of the target, or a few image exemplars, or both. Nuances in the desired targets can be expressed in natural language, which is useful for unique targets with little or no training data. We also implemented a novel combination of several techniques to improve performance, such as leveraging the additional information in the sequence of overlapping frames to perform tubelet identification (i.e., sequential bounding box matching), bounding box re-scoring, and tubelet linking. Additionally, we developed a technique to visualize the aggregate output of many overlapping frames as a mosaic of the area scanned during the aerial surveillance or reconnaissance, and a kernel density estimate (or heatmap) of the detected targets. We initially applied this ATR system to the use case of detecting and clearing unexploded ordinance on airfield runways and we are currently extending our research to other real-world applications.

[CV-39] Rainfall regression from C-band Synthetic Aperture Radar using Multi-Task Generative Adversarial Networks

链接: https://arxiv.org/abs/2411.03480
作者: Aurélien Colin,Romain Husson
关键词-EN: Synthetic Aperture Radar, Synthetic Aperture, estimate precipitation rates, rates from Synthetic, Aperture Radar
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 36 pages, 13 figures

点击查看摘要

Abstract:This paper introduces a data-driven approach to estimate precipitation rates from Synthetic Aperture Radar (SAR) at a spatial resolution of 200 meters per pixel. It addresses previous challenges related to the collocation of SAR and weather radar data, specifically the misalignment in collocations and the scarcity of rainfall examples under strong wind. To tackle these challenges, the paper proposes a multi-objective formulation, introducing patch-level components and an adversarial component. It exploits the full NEXRAD archive to look for potential co-locations with Sentinel-1 data. With additional enhancements to the training procedure and the incorporation of additional inputs, the resulting model demonstrates improved accuracy in rainfall estimates and the ability to extend its performance to scenarios up to 15 m/s.

[CV-40] Self Supervised Networks for Learning Latent Space Representations of Human Body Scans and Motions

链接: https://arxiv.org/abs/2411.03475
作者: Emmanuel Hartman,Nicolas Charon,Martin Bauer
关键词-EN: paper introduces self-supervised, introduces self-supervised neural, human body analysis, self-supervised neural network, latent space
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 23 pages, 11 figures, 6 tables

点击查看摘要

Abstract:This paper introduces self-supervised neural network models to tackle several fundamental problems in the field of 3D human body analysis and processing. First, we propose VariShaPE (Varifold Shape Parameter Estimator), a novel architecture for the retrieval of latent space representations of body shapes and poses. This network offers a fast and robust method to estimate the embedding of arbitrary unregistered meshes into the latent space. Second, we complement the estimation of latent codes with MoGeN (Motion Geometry Network) a framework that learns the geometry on the latent space itself. This is achieved by lifting the body pose parameter space into a higher dimensional Euclidean space in which body motion mini-sequences from a training set of 4D data can be approximated by simple linear interpolation. Using the SMPL latent space representation we illustrate how the combination of these network models, once trained, can be used to perform a variety of tasks with very limited computational cost. This includes operations such as motion interpolation, extrapolation and transfer as well as random shape and pose generation.

[CV-41] Fine-Grained Spatial and Verbal Losses for 3D Visual Grounding WACV2025

链接: https://arxiv.org/abs/2411.03405
作者: Sombit Dey,Ozan Unal,Christos Sakaridis,Luc Van Gool
关键词-EN: accompanying language description, visual grounding consists, visual grounding, referred instance, consists of identifying
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at WACV 2025

点击查看摘要

Abstract:3D visual grounding consists of identifying the instance in a 3D scene which is referred by an accompanying language description. While several architectures have been proposed within the commonly employed grounding-by-selection framework, the utilized losses are comparatively under-explored. In particular, most methods rely on a basic supervised cross-entropy loss on the predicted distribution over candidate instances, which fails to model both spatial relations between instances and the internal fine-grained word-level structure of the verbal referral. Sparse attempts to additionally supervise verbal embeddings globally by learning the class of the referred instance from the description or employing verbo-visual contrast to better separate instance embeddings do not fundamentally lift the aforementioned limitations. Responding to these shortcomings, we introduce two novel losses for 3D visual grounding: a visual-level offset loss on regressed vector offsets from each instance to the ground-truth referred instance and a language-related span loss on predictions for the word-level span of the referred instance in the description. In addition, we equip the verbo-visual fusion module of our new 3D visual grounding architecture AsphaltNet with a top-down bidirectional attentive fusion block, which enables the supervisory signals from our two losses to propagate to the respective converse branches of the network and thus aid the latter to learn context-aware instance embeddings and grounding-aware verbal embeddings. AsphaltNet proposes novel auxiliary losses to aid 3D visual grounding with competitive results compared to the state-of-the-art on the ReferIt3D benchmark.

[CV-42] Enhancing Maritime Situational Awareness through End-to-End Onboard Raw Data Analysis

链接: https://arxiv.org/abs/2411.03403
作者: Roberto Del Prete,Manuel Salvoldi,Domenico Barretta,Nicolas Longépé,Gabriele Meoni,Arnon Karnieli,Maria Daniela Graziano,Alfredo Renga
关键词-EN: efficient rapid response, rapid response, crucial for time-sensitive, timely and efficient, efficient rapid
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 38 pages

点击查看摘要

Abstract:Satellite-based onboard data processing is crucial for time-sensitive applications requiring timely and efficient rapid response. Advances in edge artificial intelligence are shifting computational power from ground-based centers to on-orbit platforms, transforming the “sensing-communication-decision-feedback” cycle and reducing latency from acquisition to delivery. The current research presents a framework addressing the strict bandwidth, energy, and latency constraints of small satellites, focusing on maritime monitoring. The study contributes three main innovations. Firstly, it investigates the application of deep learning techniques for direct ship detection and classification from raw satellite imagery. By simplifying the onboard processing chain, our approach facilitates direct analyses without requiring computationally intensive steps such as calibration and ortho-rectification. Secondly, to address the scarcity of raw satellite data, we introduce two novel datasets, VDS2Raw and VDV2Raw, which are derived from raw data from Sentinel-2 and Vegetation and Environment Monitoring New Micro Satellite (VENuS) missions, respectively, and enriched with Automatic Identification System (AIS) records. Thirdly, we characterize the tasks’ optimal single and multiple spectral band combinations through statistical and feature-based analyses validated on both datasets. In sum, we demonstrate the feasibility of the proposed method through a proof-of-concept on CubeSat-like hardware, confirming the models’ potential for operational satellite-based maritime monitoring.

[CV-43] Synomaly Noise and Multi-Stage Diffusion: A Novel Approach for Unsupervised Anomaly Detection in Ultrasound Imaging

链接: https://arxiv.org/abs/2411.04004
作者: Yuan Bi,Lucie Huang,Ricarda Clarenbach,Reza Ghotbi,Angelos Karlas,Nassir Navab,Zhongliang Jiang
关键词-EN: routine clinical practice, clinical practice due, multi-stage diffusion process, unsupervised anomaly detection, imaging is widely
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Ultrasound (US) imaging is widely used in routine clinical practice due to its advantages of being radiation-free, cost-effective, and portable. However, the low reproducibility and quality of US images, combined with the scarcity of expert-level annotation, make the training of fully supervised segmentation models challenging. To address these issues, we propose a novel unsupervised anomaly detection framework based on a diffusion model that incorporates a synthetic anomaly (Synomaly) noise function and a multi-stage diffusion process. Synomaly noise introduces synthetic anomalies into healthy images during training, allowing the model to effectively learn anomaly removal. The multi-stage diffusion process is introduced to progressively denoise images, preserving fine details while improving the quality of anomaly-free reconstructions. The generated high-fidelity counterfactual healthy images can further enhance the interpretability of the segmentation models, as well as provide a reliable baseline for evaluating the extent of anomalies and supporting clinical decision-making. Notably, the unsupervised anomaly detection model is trained purely on healthy images, eliminating the need for anomalous training samples and pixel-level annotations. We validate the proposed approach on carotid US, brain MRI, and liver CT datasets. The experimental results demonstrate that the proposed framework outperforms existing state-of-the-art unsupervised anomaly detection methods, achieving performance comparable to fully supervised segmentation models in the US dataset. Additionally, ablation studies underline the importance of hyperparameter selection for Synomaly noise and the effectiveness of the multi-stage diffusion process in enhancing model performance.

[CV-44] Zero-shot Dynamic MRI Reconstruction with Global-to-local Diffusion Model

链接: https://arxiv.org/abs/2411.03723
作者: Yu Guan,Kunlong Zhang,Qi Qi,Dong Wang,Ziwen Ke,Shaoyu Wang,Dong Liang,Qiegen Liu
关键词-EN: magnetic resonance imaging, recently demonstrated considerable, demonstrated considerable advancement, resonance imaging, dynamic MRI
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 9 figures

点击查看摘要

Abstract:Diffusion models have recently demonstrated considerable advancement in the generation and reconstruction of magnetic resonance imaging (MRI) data. These models exhibit great potential in handling unsampled data and reducing noise, highlighting their promise as generative models. However, their application in dynamic MRI remains relatively underexplored. This is primarily due to the substantial amount of fully-sampled data typically required for training, which is difficult to obtain in dynamic MRI due to its spatio-temporal complexity and high acquisition costs. To address this challenge, we propose a dynamic MRI reconstruction method based on a time-interleaved acquisition scheme, termed the Glob-al-to-local Diffusion Model. Specifically, fully encoded full-resolution reference data are constructed by merging under-sampled k-space data from adjacent time frames, generating two distinct bulk training datasets for global and local models. The global-to-local diffusion framework alternately optimizes global information and local image details, enabling zero-shot reconstruction. Extensive experiments demonstrate that the proposed method performs well in terms of noise reduction and detail preservation, achieving reconstruction quality comparable to that of supervised approaches.

[CV-45] ADMIRE: a locally adaptive single-image non-uniformity correction and denoising algorithm: application to uncooled IR camera

链接: https://arxiv.org/abs/2411.03615
作者: Yohann Tendero,Jerome Gilles
关键词-EN: uncooled infrared-type images, uncooled infrared-type, infrared-type images, Abstract, method
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose a new way to correct for the non-uniformity (NU) and the noise in uncooled infrared-type images. This method works on static images, needs no registration, no camera motion and no model for the non uniformity. The proposed method uses an hybrid scheme including an automatic locally-adaptive contrast adjustment and a state-of-the-art image denoising method. It permits to correct for a fully non-linear NU and the noise efficiently using only one image. We compared it with total variation on real raw and simulated NU infrared images. The strength of this approach lies in its simplicity, low computational cost. It needs no test-pattern or calibration and produces no “ghost-artefact”.

[CV-46] Enhancing Weakly Supervised Semantic Segmentation for Fibrosis via Controllable Image Generation

链接: https://arxiv.org/abs/2411.03551
作者: Zhiling Yue,Yingying Fang,Liutao Yang,Nikhil Baid,Simon Walsh,Guang Yang
关键词-EN: Fibrotic Lung Disease, severe condition marked, Fibrotic Lung, Lung Disease, lung stiffening
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Fibrotic Lung Disease (FLD) is a severe condition marked by lung stiffening and scarring, leading to respiratory decline. High-resolution computed tomography (HRCT) is critical for diagnosing and monitoring FLD; however, fibrosis appears as irregular, diffuse patterns with unclear boundaries, leading to high inter-observer variability and time-intensive manual annotation. To tackle this challenge, we propose DiffSeg, a novel weakly supervised semantic segmentation (WSSS) method that uses image-level annotations to generate pixel-level fibrosis segmentation, reducing the need for fine-grained manual labeling. Additionally, our DiffSeg incorporates a diffusion-based generative model to synthesize HRCT images with different levels of fibrosis from healthy slices, enabling the generation of the fibrosis-injected slices and their paired fibrosis location. Experiments indicate that our method significantly improves the accuracy of pseudo masks generated by existing WSSS methods, greatly reducing the complexity of manual labeling and enhancing the consistency of the generated masks.

[CV-47] opoTxR: A topology-guided deep convolutional network for breast parenchyma learning on DCE-MRIs WWW

链接: https://arxiv.org/abs/2411.03464
作者: Fan Wang,Zhilin Zou,Nicole Sakla,Luke Partyka,Nil Rawal,Gagandeep Singh,Wei Zhao,Haibin Ling,Chuan Huang,Prateek Prasanna,Chao Chen
关键词-EN: dynamic contrast-enhanced magnetic, contrast-enhanced magnetic resonance, challenging task owing, magnetic resonance imaging, breast parenchymal structures
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 22 pages, 8 figures, 8 tables, accepted by Medical Image Analysis ( this https URL )

点击查看摘要

Abstract:Characterization of breast parenchyma in dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) is a challenging task owing to the complexity of underlying tissue structures. Existing quantitative approaches, like radiomics and deep learning models, lack explicit quantification of intricate and subtle parenchymal structures, including fibroglandular tissue. To address this, we propose a novel topological approach that explicitly extracts multi-scale topological structures to better approximate breast parenchymal structures, and then incorporates these structures into a deep-learning-based prediction model via an attention mechanism. Our topology-informed deep learning model, \emphTopoTxR, leverages topology to provide enhanced insights into tissues critical for disease pathophysiology and treatment response. We empirically validate \emphTopoTxR using the VICTRE phantom breast dataset, showing that the topological structures extracted by our model effectively approximate the breast parenchymal structures. We further demonstrate \emphTopoTxR’s efficacy in predicting response to neoadjuvant chemotherapy. Our qualitative and quantitative analyses suggest differential topological behavior of breast tissue in treatment-naïve imaging, in patients who respond favorably to therapy as achieving pathological complete response (pCR) versus those who do not. In a comparative analysis with several baselines on the publicly available I-SPY 1 dataset (N=161, including 47 patients with pCR and 114 without) and the Rutgers proprietary dataset (N=120, with 69 patients achieving pCR and 51 not), \emphTopoTxR demonstrates a notable improvement, achieving a 2.6% increase in accuracy and a 4.6% enhancement in AUC compared to the state-of-the-art method.

[CV-48] BOston Neonatal Brain Injury Data for Hypoxic Ischemic Encephalopathy (BONBID-HIE): II. 2-year Neurocognitive Outcome and NICU Outcome MICCAI2024

链接: https://arxiv.org/abs/2411.03456
作者: Rina Bao,Yangming Ou
关键词-EN: Hypoxic Ischemic Encephalopathy, Hypoxic Ischemic, Ischemic Encephalopathy, affects approximately, newborns globally
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Data description for BONBID-HIE 2024 Challenge on MICCAI 2024

点击查看摘要

Abstract:Hypoxic Ischemic Encephalopathy (HIE) affects approximately 1-5/1000 newborns globally and leads to adverse neurocognitive outcomes in 30% to 50% of cases by two years of age. Despite therapeutic advances with Therapeutic Hypothermia (TH), prognosis remains challenging, highlighting the need for improved biomarkers. This paper introduces the second release of the Boston Neonatal Brain Injury Dataset for Hypoxic-Ischemic Encephalopathy (BONBID-HIE), an open-source, comprehensive MRI and clinical dataset featuring 237 patients, including NICU outcomes and 2-year neurocognitive outcomes from Massachusetts General Hospital and Boston Children’s Hospital.

[CV-49] Interpretable Embeddings for Segmentation-Free Single-Cell Analysis in Multiplex Imaging

链接: https://arxiv.org/abs/2411.03341
作者: Simon Gutwein,Daria Lazic,Thomas Walter,Sabine Taschner-Mandl,Roxane Licandro
关键词-EN: providing valuable insights, multiple biological markers, Multiplex Imaging, subcellular resolution, providing valuable
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 Pages, 5 Figures, Submitted to ISBI 2025

点击查看摘要

Abstract:Multiplex Imaging (MI) enables the simultaneous visualization of multiple biological markers in separate imaging channels at subcellular resolution, providing valuable insights into cell-type heterogeneity and spatial organization. However, current computational pipelines rely on cell segmentation algorithms, which require laborious fine-tuning and can introduce downstream errors due to inaccurate single-cell representations. We propose a segmentation-free deep learning approach that leverages grouped convolutions to learn interpretable embedded features from each imaging channel, enabling robust cell-type identification without manual feature selection. Validated on an Imaging Mass Cytometry dataset of 1.8 million cells from neuroblastoma patients, our method enables the accurate identification of known cell types, showcasing its scalability and suitability for high-dimensional MI data.

机器学习

[LG-0] Weighted Sobolev Approximation Rates for Neural Networks on Unbounded Domains

链接: https://arxiv.org/abs/2411.04108
作者: Ahmed Abdeljawad,Thomas Dittrich
关键词-EN: spectral Barron space, weighted Sobolev spaces, spectral Barron, Barron space, Sobolev spaces
类目: Machine Learning (cs.LG); Functional Analysis (math.FA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this work, we consider the approximation capabilities of shallow neural networks in weighted Sobolev spaces for functions in the spectral Barron space. The existing literature already covers several cases, in which the spectral Barron space can be approximated well, i.e., without curse of dimensionality, by shallow networks and several different classes of activation function. The limitations of the existing results are mostly on the error measures that were considered, in which the results are restricted to Sobolev spaces over a bounded domain. We will here treat two cases that extend upon the existing results. Namely, we treat the case with bounded domain and Muckenhoupt weights and the case, where the domain is allowed to be unbounded and the weights are required to decay. We first present embedding results for the more general weighted Fourier-Lebesgue spaces in the weighted Sobolev spaces and then we establish asymptotic approximation rates for shallow neural networks that come without curse of dimensionality.

[LG-1] A Comparative Study of Deep Reinforcement Learning for Crop Production Management

链接: https://arxiv.org/abs/2411.04106
作者: Joseph Balderas,Dong Chen,Yanbo Huang,Li Wang,Ren-Cang Li
关键词-EN: field environmental impact, stochastic processes involved, remains challenging due, crop management, field environmental
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Crop production management is essential for optimizing yield and minimizing a field’s environmental impact to crop fields, yet it remains challenging due to the complex and stochastic processes involved. Recently, researchers have turned to machine learning to address these complexities. Specifically, reinforcement learning (RL), a cutting-edge approach designed to learn optimal decision-making strategies through trial and error in dynamic environments, has emerged as a promising tool for developing adaptive crop management policies. RL models aim to optimize long-term rewards by continuously interacting with the environment, making them well-suited for tackling the uncertainties and variability inherent in crop management. Studies have shown that RL can generate crop management policies that compete with, and even outperform, expert-designed policies within simulation-based crop models. In the gym-DSSAT crop model environment, one of the most widely used simulators for crop management, proximal policy optimization (PPO) and deep Q-networks (DQN) have shown promising results. However, these methods have not yet been systematically evaluated under identical conditions. In this study, we evaluated PPO and DQN against static baseline policies across three different RL tasks, fertilization, irrigation, and mixed management, provided by the gym-DSSAT environment. To ensure a fair comparison, we used consistent default parameters, identical reward functions, and the same environment settings. Our results indicate that PPO outperforms DQN in fertilization and irrigation tasks, while DQN excels in the mixed management task. This comparative analysis provides critical insights into the strengths and limitations of each approach, advancing the development of more effective RL-based crop management strategies.

[LG-2] Interpretable and Efficient Data-driven Discovery and Control of Distributed Systems

链接: https://arxiv.org/abs/2411.04098
作者: Florian Wolf,Nicolò Botteghi,Urban Fasel,Andrea Manzoni
关键词-EN: Effectively controlling systems, Sciences and Engineering, Applied Sciences, Partial Differential Equations, Effectively controlling
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Effectively controlling systems governed by Partial Differential Equations (PDEs) is crucial in several fields of Applied Sciences and Engineering. These systems usually yield significant challenges to conventional control schemes due to their nonlinear dynamics, partial observability, high-dimensionality once discretized, distributed nature, and the requirement for low-latency feedback control. Reinforcement Learning (RL), particularly Deep RL (DRL), has recently emerged as a promising control paradigm for such systems, demonstrating exceptional capabilities in managing high-dimensional, nonlinear dynamics. However, DRL faces challenges including sample inefficiency, robustness issues, and an overall lack of interpretability. To address these issues, we propose a data-efficient, interpretable, and scalable Dyna-style Model-Based RL framework for PDE control, combining the Sparse Identification of Nonlinear Dynamics with Control (SINDy-C) algorithm and an autoencoder (AE) framework for the sake of dimensionality reduction of PDE states and actions. This novel approach enables fast rollouts, reducing the need for extensive environment interactions, and provides an interpretable latent space representation of the PDE forward dynamics. We validate our method on two PDE problems describing fluid flows - namely, the 1D Burgers equation and 2D Navier-Stokes equations - comparing it against a model-free baseline, and carrying out an extensive analysis of the learned dynamics.

[LG-3] Problem Space Transformations for Generalisation in Behavioural Cloning

链接: https://arxiv.org/abs/2411.04056
作者: Kiran Doshi,Marco Bagatella,Stelian Coros
关键词-EN: driven significant progress, networks has driven, driven significant, significant progress, robotic manipulation
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The combination of behavioural cloning and neural networks has driven significant progress in robotic manipulation. As these algorithms may require a large number of demonstrations for each task of interest, they remain fundamentally inefficient in complex scenarios. This issue is aggravated when the system is treated as a black-box, ignoring its physical properties. This work characterises widespread properties of robotic manipulation, such as pose equivariance and locality. We empirically demonstrate that transformations arising from each of these properties allow neural policies trained with behavioural cloning to better generalise to out-of-distribution problem instances.

[LG-4] Stepping Forward on the Last Mile

链接: https://arxiv.org/abs/2411.04036
作者: Chen Feng,Shaojie Zhuo,Xiaopeng Zhang,Ramchalam Kinattinkara Ramakrishnan,Zhaocong Yuan,Andrew Zou Li
关键词-EN: Continuously adapting pre-trained, Continuously adapting, resource constrained edge, constrained edge devices, adapting pre-trained models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continuously adapting pre-trained models to local data on resource constrained edge devices is the \emphlast mile for model deployment. However, as models increase in size and depth, backpropagation requires a large amount of memory, which becomes prohibitive for edge devices. In addition, most existing low power neural processing engines (e.g., NPUs, DSPs, MCUs, etc.) are designed as fixed-point inference accelerators, without training capabilities. Forward gradients, solely based on directional derivatives computed from two forward calls, have been recently used for model training, with substantial savings in computation and memory. However, the performance of quantized training with fixed-point forward gradients remains unclear. In this paper, we investigate the feasibility of on-device training using fixed-point forward gradients, by conducting comprehensive experiments across a variety of deep learning benchmark tasks in both vision and audio domains. We propose a series of algorithm enhancements that further reduce the memory footprint, and the accuracy gap compared to backpropagation. An empirical study on how training with forward gradients navigates in the loss landscape is further explored. Our results demonstrate that on the last mile of model customization on edge devices, training with fixed-point forward gradients is a feasible and practical approach.

[LG-5] Multi-Scale and Multimodal Species Distribution Modeling ECCV2024

链接: https://arxiv.org/abs/2411.04016
作者: Nina van Tiel,Robin Zbinden,Emanuele Dalsasso,Benjamin Kellenberger,Loïc Pellissier,Devis Tuia
关键词-EN: Species distribution models, relating occurrence data, Species distribution, aim to predict, relating occurrence
类目: Machine Learning (cs.LG)
*备注: Published at the CV4Ecology workshop at ECCV 2024 ( this https URL )

点击查看摘要

Abstract:Species distribution models (SDMs) aim to predict the distribution of species by relating occurrence data with environmental variables. Recent applications of deep learning to SDMs have enabled new avenues, specifically the inclusion of spatial data (environmental rasters, satellite images) as model predictors, allowing the model to consider the spatial context around each species’ observations. However, the appropriate spatial extent of the images is not straightforward to determine and may affect the performance of the model, as scale is recognized as an important factor in SDMs. We develop a modular structure for SDMs that allows us to test the effect of scale in both single- and multi-scale settings. Furthermore, our model enables different scales to be considered for different modalities, using a late fusion approach. Results on the GeoLifeCLEF 2023 benchmark indicate that considering multimodal data and learning multi-scale representations leads to more accurate models.

[LG-6] kNN Attention Demystified: A Theoretical Exploration for Scalable Transformers

链接: https://arxiv.org/abs/2411.04013
作者: Themistoklis Haris
关键词-EN: long sequences due, Transformers face challenges, face challenges, challenges with long, long sequences
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: 30 pages, 12 figures

点击查看摘要

Abstract:Despite their power, Transformers face challenges with long sequences due to the quadratic complexity of self-attention. To address this limitation, methods like k -Nearest-Neighbor ( k NN) attention have been introduced [Roy, Saffar, Vaswani, Grangier, 2021] enabling each token to attend to only its k closest tokens. While k NN attention has shown empirical success in making Transformers more efficient, its exact approximation guarantees have not been theoretically analyzed. In this work, we establish a theoretical framework for k NN attention, reformulating self-attention as expectations over softmax distributions and leveraging lazy Gumbel sampling [Mussmann, Levy, Ermon, 2017] with k NN indices for efficient approximation. Building on this framework, we also propose novel sub-quadratic algorithms that approximate self-attention gradients by leveraging efficient sampling techniques, such as Markov Chain-based estimation. Finally, we demonstrate the practical effectiveness of these algorithms through empirical experiments, showcasing their benefits in both training and inference.

[LG-7] Customized Multiple Clustering via Multi-Modal Subspace Proxy Learning NEURIPS2024

链接: https://arxiv.org/abs/2411.03978
作者: Jiawei Yao,Qi Qian,Juhua Hu
关键词-EN: Multiple clustering, Multiple clustering aims, aims to discover, discover various latent, latent structures
类目: Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Multiple clustering aims to discover various latent structures of data from different aspects. Deep multiple clustering methods have achieved remarkable performance by exploiting complex patterns and relationships in data. However, existing works struggle to flexibly adapt to diverse user-specific needs in data grouping, which may require manual understanding of each clustering. To address these limitations, we introduce Multi-Sub, a novel end-to-end multiple clustering approach that incorporates a multi-modal subspace proxy learning framework in this work. Utilizing the synergistic capabilities of CLIP and GPT-4, Multi-Sub aligns textual prompts expressing user preferences with their corresponding visual representations. This is achieved by automatically generating proxy words from large language models that act as subspace bases, thus allowing for the customized representation of data in terms specific to the user’s interests. Our method consistently outperforms existing baselines across a broad set of datasets in visual multiple clustering tasks. Our code is available at this https URL.

[LG-8] GUIDE-VAE: Advancing Data Generation with User Information and Pattern Dictionaries

链接: https://arxiv.org/abs/2411.03936
作者: Kutay Bölat,Simon Tindemans
关键词-EN: science and engineering, prominent in science, Machine Learning, data, GUIDE-VAE
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Generative modelling of multi-user datasets has become prominent in science and engineering. Generating a data point for a given user requires employing user information, and conventional generative models, including variational autoencoders (VAEs), often ignore that. This paper introduces GUIDE-VAE, a novel conditional generative model that leverages user embeddings to generate user-guided data. By allowing the model to benefit from shared patterns across users, GUIDE-VAE enhances performance in multi-user settings, even under significant data imbalance. In addition to integrating user information, GUIDE-VAE incorporates a pattern dictionary-based covariance composition (PDCC) to improve the realism of generated samples by capturing complex feature dependencies. While user embeddings drive performance gains, PDCC addresses common issues such as noise and over-smoothing typically seen in VAEs. The proposed GUIDE-VAE was evaluated on a multi-user smart meter dataset characterized by substantial data imbalance across users. Quantitative results show that GUIDE-VAE performs effectively in both synthetic data generation and missing record imputation tasks, while qualitative evaluations reveal that GUIDE-VAE produces more plausible and less noisy data. These results establish GUIDE-VAE as a promising tool for controlled, realistic data generation in multi-user datasets, with potential applications across various domains requiring user-informed modelling. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2411.03936 [cs.LG] (or arXiv:2411.03936v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.03936 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-9] Quantum Algorithm for Sparse Online Learning with Truncated Gradient Descent

链接: https://arxiv.org/abs/2411.03925
作者: Debbie Lim,Yixian Qiu,Patrick Rebentrost,Qisheng Wang
关键词-EN: Support Vector Machine, computer science community, Vector Machine, Support Vector, science community
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 31 pages, 1 table, 4 algorithms

点击查看摘要

Abstract:Logistic regression, the Support Vector Machine (SVM), and least squares are well-studied methods in the statistical and computer science community, with various practical applications. High-dimensional data arriving on a real-time basis makes the design of online learning algorithms that produce sparse solutions essential. The seminal work of \hyperlinkcite.langford2009sparseLangford, Li, and Zhang (2009) developed a method to obtain sparsity via truncated gradient descent, showing a near-optimal online regret bound. Based on this method, we develop a quantum sparse online learning algorithm for logistic regression, the SVM, and least squares. Given efficient quantum access to the inputs, we show that a quadratic speedup in the time complexity with respect to the dimension of the problem is achievable, while maintaining a regret of O(1/\sqrtT) , where T is the number of iterations.

[LG-10] Game-Theoretic Machine Unlearning: Mitigating Extra Privacy Leakage

链接: https://arxiv.org/abs/2411.03914
作者: Hengzhu Liu,Tianqing Zhu,Lefeng Zhang,Ping Xiong
关键词-EN: providers encounter increasing, encounter increasing privacy, data providers encounter, machine learning technologies, providers encounter
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the extensive use of machine learning technologies, data providers encounter increasing privacy risks. Recent legislation, such as GDPR, obligates organizations to remove requested data and its influence from a trained model. Machine unlearning is an emerging technique designed to enable machine learning models to erase users’ private information. Although several efficient machine unlearning schemes have been proposed, these methods still have limitations. First, removing the contributions of partial data may lead to model performance degradation. Second, discrepancies between the original and generated unlearned models can be exploited by attackers to obtain target sample’s information, resulting in additional privacy leakage risks. To address above challenges, we proposed a game-theoretic machine unlearning algorithm that simulates the competitive relationship between unlearning performance and privacy protection. This algorithm comprises unlearning and privacy modules. The unlearning module possesses a loss function composed of model distance and classification error, which is used to derive the optimal strategy. The privacy module aims to make it difficult for an attacker to infer membership information from the unlearned data, thereby reducing the privacy leakage risk during the unlearning process. Additionally, the experimental results on real-world datasets demonstrate that this game-theoretic unlearning algorithm’s effectiveness and its ability to generate an unlearned model with a performance similar to that of the retrained one while mitigating extra privacy leakage risks.

[LG-11] Retentive Neural Quantum States: Efficient Ans"atze for Ab Initio Quantum Chemistry

链接: https://arxiv.org/abs/2411.03900
作者: Oliver Knitter,Dan Zhao,James Stokes,Martin Ganahl,Stefan Leichenauer,Shravan Veerapaneni
关键词-EN: Monte Carlo methods, variational Monte Carlo, Monte Carlo, quantum-inspired deep learning, Neural-network quantum states
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Quantum Physics (quant-ph)
*备注: 16 pages, 1 figure, to be submitted for peer-reviewed publication

点击查看摘要

Abstract:Neural-network quantum states (NQS) has emerged as a powerful application of quantum-inspired deep learning for variational Monte Carlo methods, offering a competitive alternative to existing techniques for identifying ground states of quantum problems. A significant advancement toward improving the practical scalability of NQS has been the incorporation of autoregressive models, most recently transformers, as variational ansatze. Transformers learn sequence information with greater expressiveness than recurrent models, but at the cost of increased time complexity with respect to sequence length. We explore the use of the retentive network (RetNet), a recurrent alternative to transformers, as an ansatz for solving electronic ground state problems in \textitab initio quantum chemistry. Unlike transformers, RetNets overcome this time complexity bottleneck by processing data in parallel during training, and recurrently during inference. We give a simple computational cost estimate of the RetNet and directly compare it with similar estimates for transformers, establishing a clear threshold ratio of problem-to-model size past which the RetNet’s time complexity outperforms that of the transformer. Though this efficiency can comes at the expense of decreased expressiveness relative to the transformer, we overcome this gap through training strategies that leverage the autoregressive structure of the model – namely, variational neural annealing. Our findings support the RetNet as a means of improving the time complexity of NQS without sacrificing accuracy. We provide further evidence that the ablative improvements of neural annealing extend beyond the RetNet architecture, suggesting it would serve as an effective general training strategy for autoregressive NQS.

[LG-12] Calibrating for the Future:Enhancing Calorimeter Longevity with Deep Learning

链接: https://arxiv.org/abs/2411.03891
作者: S. Ali,A.S. Ryzhikov,D.A. Derkach,F.D. Ratnikov,V.O. Bocharnikov
关键词-EN: realm of high-energy, high-energy physics, Wasserstein GAN inspired, particle physics experiments, longevity of calorimeters
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the realm of high-energy physics, the longevity of calorimeters is paramount. Our research introduces a deep learning strategy to refine the calibration process of calorimeters used in particle physics experiments. We develop a Wasserstein GAN inspired methodology that adeptly calibrates the misalignment in calorimeter data due to aging or other factors. Leveraging the Wasserstein distance for loss calculation, this innovative approach requires a significantly lower number of events and resources to achieve high precision, minimizing absolute errors effectively. Our work extends the operational lifespan of calorimeters, thereby ensuring the accuracy and reliability of data in the long term, and is particularly beneficial for experiments where data integrity is crucial for scientific discovery.

[LG-13] EXPLORA: Efficient Exemplar Subset Selection for Complex Reasoning

链接: https://arxiv.org/abs/2411.03877
作者: Kiran Purohit,Venktesh V,Raghuram Devalla,Krishna Mohan Yerragorla,Sourangshu Bhattacharya,Avishek Anand
关键词-EN: Answering reasoning-based complex, Answering reasoning-based, reasoning-based complex questions, including tables, hybrid sources
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Answering reasoning-based complex questions over text and hybrid sources, including tables, is a challenging task. Recent advances in large language models (LLMs) have enabled in-context learning (ICL), allowing LLMs to acquire proficiency in a specific task using only a few demonstration samples (exemplars). A critical challenge in ICL is the selection of optimal exemplars, which can be either task-specific (static) or test-example-specific (dynamic). Static exemplars provide faster inference times and increased robustness across a distribution of test examples. In this paper, we propose an algorithm for static exemplar subset selection for complex reasoning tasks. We introduce EXPLORA, a novel exploration method designed to estimate the parameters of the scoring function, which evaluates exemplar subsets without incorporating confidence information. EXPLORA significantly reduces the number of LLM calls to ~11% of those required by state-of-the-art methods and achieves a substantial performance improvement of 12.24%. We open-source our code and data (this https URL).

[LG-14] Large Generative Model-assisted Talking-face Semantic Communication System

链接: https://arxiv.org/abs/2411.03876
作者: Feibo Jiang,Siwei Tu,Li Dong,Cunhua Pan,Jiangzhou Wang,Xiaohu You
关键词-EN: generative Artificial Intelligence, Artificial Intelligence, Talking-face Semantic Communication, generative Artificial, Semantic Communication
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid development of generative Artificial Intelligence (AI) continually unveils the potential of Semantic Communication (SemCom). However, current talking-face SemCom systems still encounter challenges such as low bandwidth utilization, semantic ambiguity, and diminished Quality of Experience (QoE). This study introduces a Large Generative Model-assisted Talking-face Semantic Communication (LGM-TSC) System tailored for the talking-face video communication. Firstly, we introduce a Generative Semantic Extractor (GSE) at the transmitter based on the FunASR model to convert semantically sparse talking-face videos into texts with high information density. Secondly, we establish a private Knowledge Base (KB) based on the Large Language Model (LLM) for semantic disambiguation and correction, complemented by a joint knowledge base-semantic-channel coding scheme. Finally, at the receiver, we propose a Generative Semantic Reconstructor (GSR) that utilizes BERT-VITS2 and SadTalker models to transform text back into a high-QoE talking-face video matching the user’s timbre. Simulation results demonstrate the feasibility and effectiveness of the proposed LGM-TSC system.

[LG-15] Efficient Message Passing Architecture for GCN Training on HBM-based FPGAs with Orthogonal Topology On-Chip Networks

链接: https://arxiv.org/abs/2411.03857
作者: Qizhe Wu,Letian Zhao,Yuchen Gui,Huawen Liang Xiaotian Wang
关键词-EN: deep learning models, Graph Convolutional Networks, Convolutional Networks, Graph Convolutional, deep learning
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: This paper has been accepted for 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays(FPGA’24) as poster

点击查看摘要

Abstract:Graph Convolutional Networks (GCNs) are state-of-the-art deep learning models for representation learning on graphs. However, the efficient training of GCNs is hampered by constraints in memory capacity and bandwidth, compounded by the irregular data flow that results in communication bottlenecks. To address these challenges, we propose a message-passing architecture that leverages NUMA-based memory access properties and employs a parallel multicast routing algorithm based on a 4-D hypercube network within the accelerator for efficient message passing in graphs. Additionally, we have re-engineered the backpropagation algorithm specific to GCNs within our proposed accelerator. This redesign strategically mitigates the memory demands prevalent during the training phase and diminishes the computational overhead associated with the transposition of extensive matrices. Compared to the state-of-the-art HP-GNN architecture we achieved a performance improvement of 1.03\times \sim 1.81\times .

[LG-16] Flexible task abstractions emerge in linear networks with fast and bounded units

链接: https://arxiv.org/abs/2411.03840
作者: Kai Sandbrink,Jan P. Bauer,Alexandra M. Proca,Andrew M. Saxe,Christopher Summerfield,Ali Hummos
关键词-EN: data distribution shifts, dynamic environments changing, environments changing, changing at arbitrary, task
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Animals survive in dynamic environments changing at arbitrary timescales, but such data distribution shifts are a challenge to neural networks. To adapt to change, neural systems may change a large number of parameters, which is a slow process involving forgetting past information. In contrast, animals leverage distribution changes to segment their stream of experience into tasks and associate them with internal task abstracts. Animals can then respond flexibly by selecting the appropriate task abstraction. However, how such flexible task abstractions may arise in neural systems remains unknown. Here, we analyze a linear gated network where the weights and gates are jointly optimized via gradient descent, but with neuron-like constraints on the gates including a faster timescale, nonnegativity, and bounded activity. We observe that the weights self-organize into modules specialized for tasks or sub-tasks encountered, while the gates layer forms unique representations that switch the appropriate weight modules (task abstractions). We analytically reduce the learning dynamics to an effective eigenspace, revealing a virtuous cycle: fast adapting gates drive weight specialization by protecting previous knowledge, while weight specialization in turn increases the update rate of the gating layer. Task switching in the gating layer accelerates as a function of curriculum block size and task training, mirroring key findings in cognitive neuroscience. We show that the discovered task abstractions support generalization through both task and subtask composition, and we extend our findings to a non-linear network switching between two tasks. Overall, our work offers a theory of cognitive flexibility in animals as arising from joint gradient descent on synaptic and neural gating in a neural network architecture.

[LG-17] Hybrid Transfer Reinforcement Learning: Provable Sample Efficiency from Shifted-Dynamics Data

链接: https://arxiv.org/abs/2411.03810
作者: Chengrui Qu,Laixi Shi,Kishan Panaganti,Pengcheng You,Adam Wierman
关键词-EN: typically requires high-stakes, Online Reinforcement learning, Reinforcement learning, requires high-stakes online, high-stakes online interaction
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Online Reinforcement learning (RL) typically requires high-stakes online interaction data to learn a policy for a target task. This prompts interest in leveraging historical data to improve sample efficiency. The historical data may come from outdated or related source environments with different dynamics. It remains unclear how to effectively use such data in the target task to provably enhance learning and sample efficiency. To address this, we propose a hybrid transfer RL (HTRL) setting, where an agent learns in a target environment while accessing offline data from a source environment with shifted dynamics. We show that – without information on the dynamics shift – general shifted-dynamics data, even with subtle shifts, does not reduce sample complexity in the target environment. However, with prior information on the degree of the dynamics shift, we design HySRL, a transfer algorithm that achieves problem-dependent sample complexity and outperforms pure online RL. Finally, our experimental results demonstrate that HySRL surpasses state-of-the-art online RL baseline.

[LG-18] On the Decomposition of Differential Game

链接: https://arxiv.org/abs/2411.03802
作者: Nanxiang Zhou,Jing Dong,Yutian Li,Baoxiang Wang
关键词-EN: potential part, vector potential part, potential, scalar potential part, differential games
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To understand the complexity of the dynamic of learning in differential games, we decompose the game into components where the dynamic is well understood. One of the possible tools is Helmholtz’s theorem, which can decompose a vector field into a potential and a harmonic component. This has been shown to be effective in finite and normal-form games. However, applying Helmholtz’s theorem by connecting it with the Hodge theorem on \mathbbR^n (which is the strategy space of differential game) is non-trivial due to the non-compactness of \mathbbR^n . Bridging the dynamic-strategic disconnect through Hodge/Helmoltz’s theorem in differential games is then left as an open problem \citeletcher2019differentiable. In this work, we provide two decompositions of differential games to answer this question: the first as an exact scalar potential part, a near vector potential part, and a non-strategic part; the second as a near scalar potential part, an exact vector potential part, and a non-strategic part. We show that scalar potential games coincide with potential games proposed by \citemonderer1996potential, where the gradient descent dynamic can successfully find the Nash equilibrium. For the vector potential game, we show that the individual gradient field is divergence-free, in which case the gradient descent dynamic may either be divergent or recurrent.

[LG-19] he N-Grammys: Accelerating Autoregressive Inference with Learning-Free Batched Speculation

链接: https://arxiv.org/abs/2411.03786
作者: Lawrence Stewart(SIERRA),Matthew Trager,Sujan Kumar Gonugondla,Stefano Soatto(UCLA-CS)
关键词-EN: Speculative decoding aims, negligible-cost draft strategies, URL this work, Speculative decoding, http URL
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Speculative decoding aims to speed up autoregressive generation of a language model by verifying in parallel the tokens generated by a smaller draft this http URL this work, we explore the effectiveness of learning-free, negligible-cost draft strategies, namely N -grams obtained from the model weights and the context. While the predicted next token of the base model is rarely the top prediction of these simple strategies, we observe that it is often within their top- k predictions for small k . Based on this, we show that combinations of simple strategies can achieve significant inference speedups over different tasks. The overall performance is comparable to more complex methods, yet does not require expensive preprocessing or modification of the base model, and allows for seamless `plug-and-play’ integration into pipelines.

[LG-20] A Bayesian Approach to Data Point Selection

链接: https://arxiv.org/abs/2411.03768
作者: Xinnuo Xu,Minyoung Kim,Royson Lee,Brais Martinez,Timothy Hospedales
关键词-EN: Data point selection, uncurated training data, training data compared, deep learning due, acquiring uncurated training
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data point selection (DPS) is becoming a critical topic in deep learning due to the ease of acquiring uncurated training data compared to the difficulty of obtaining curated or processed data. Existing approaches to DPS are predominantly based on a bi-level optimisation (BLO) formulation, which is demanding in terms of memory and computation, and exhibits some theoretical defects regarding minibatches. Thus, we propose a novel Bayesian approach to DPS. We view the DPS problem as posterior inference in a novel Bayesian model where the posterior distributions of the instance-wise weights and the main neural network parameters are inferred under a reasonable prior and likelihood model. We employ stochastic gradient Langevin MCMC sampling to learn the main network and instance-wise weights jointly, ensuring convergence even with minibatches. Our update equation is comparable to the widely used SGD and much more efficient than existing BLO-based methods. Through controlled experiments in both the vision and language domains, we present the proof-of-concept. Additionally, we demonstrate that our method scales effectively to large language models and facilitates automated per-task optimization for instruction fine-tuning datasets.

[LG-21] Variational Inference on the Boolean Hypercube with the Quantum Entropy

链接: https://arxiv.org/abs/2411.03759
作者: Eliot Beyler(SIERRA),Francis Bach(SIERRA)
关键词-EN: pairwise Markov random, Markov random fields, Boolean hypercube, pairwise Markov, Markov random
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this paper, we derive variational inference upper-bounds on the log-partition function of pairwise Markov random fields on the Boolean hypercube, based on quantum relaxations of the Kullback-Leibler divergence. We then propose an efficient algorithm to compute these bounds based on primal-dual optimization. An improvement of these bounds through the use of ‘‘hierarchies,’’ similar to sum-of-squares (SoS) hierarchies is proposed, and we present a greedy algorithm to select among these relaxations. We carry extensive numerical experiments and compare with state-of-the-art methods for this inference problem.

[LG-22] Symbolic regression via MDLformer-guided search: from minimizing prediction error to minimizing description length

链接: https://arxiv.org/abs/2411.03753
作者: Zihan Yu,Jingtao Ding,Yong Li
关键词-EN: task discovering, heuristical search, search, symbolic regression method, target formula
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Symbolic regression, a task discovering the formula best fitting the given data, is typically based on the heuristical search. These methods usually update candidate formulas to obtain new ones with lower prediction errors iteratively. However, since formulas with similar function shapes may have completely different symbolic forms, the prediction error does not decrease monotonously as the search approaches the target formula, causing the low recovery rate of existing methods. To solve this problem, we propose a novel search objective based on the minimum description length, which reflects the distance from the target and decreases monotonically as the search approaches the correct form of the target formula. To estimate the minimum description length of any input data, we design a neural network, MDLformer, which enables robust and scalable estimation through large-scale training. With the MDLformer’s output as the search objective, we implement a symbolic regression method, SR4MDL, that can effectively recover the correct mathematical form of the formula. Extensive experiments illustrate its excellent performance in recovering formulas from data. Our method successfully recovers around 50 formulas across two benchmark datasets comprising 133 problems, outperforming state-of-the-art methods by 43.92%.

[LG-23] Graph Neural Networks with Coarse- and Fine-Grained Division for Mitigating Label Sparsity and Noise

链接: https://arxiv.org/abs/2411.03744
作者: Shuangjie Li,Baoming Zhang,Jianqing Song,Gaoli Ruan,Chongjun Wang,Junyuan Xie
关键词-EN: Graph Neural Networks, Neural Networks, processing graph-structured data, gained considerable prominence, Graph Neural
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have gained considerable prominence in semi-supervised learning tasks in processing graph-structured data, primarily owing to their message-passing mechanism, which largely relies on the availability of clean labels. However, in real-world scenarios, labels on nodes of graphs are inevitably noisy and sparsely labeled, significantly degrading the performance of GNNs. Exploring robust GNNs for semi-supervised node classification in the presence of noisy and sparse labels remains a critical challenge. Therefore, we propose a novel \textbfGraph \textbfNeural \textbfNetwork with \textbfCoarse- and \textbfFine-\textbfGrained \textbfDivision for mitigating label sparsity and noise, namely GNN-CFGD. The key idea of GNN-CFGD is reducing the negative impact of noisy labels via coarse- and fine-grained division, along with graph reconstruction. Specifically, we first investigate the effectiveness of linking unlabeled nodes to cleanly labeled nodes, demonstrating that this approach is more effective in combating labeling noise than linking to potentially noisy labeled nodes. Based on this observation, we introduce a Gaussian Mixture Model (GMM) based on the memory effect to perform a coarse-grained division of the given labels into clean and noisy labels. Next, we propose a clean labels oriented link that connects unlabeled nodes to cleanly labeled nodes, aimed at mitigating label sparsity and promoting supervision propagation. Furthermore, to provide refined supervision for noisy labeled nodes and additional supervision for unlabeled nodes, we fine-grain the noisy labeled and unlabeled nodes into two candidate sets based on confidence, respectively. Extensive experiments on various datasets demonstrate the superior effectiveness and robustness of GNN-CFGD.

[LG-24] Human-in-the-Loop Feature Selection Using Interpretable Kolmogorov-Arnold Network-based Double Deep Q-Network

链接: https://arxiv.org/abs/2411.03740
作者: Md Abrar Jahin,M. F. Mridha,Nilanjan Dey
关键词-EN: increase computational demands, complex feature interactions, Feature selection, computational demands, Feature
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Applications (stat.AP)
*备注: Submitted to a journal under IEEE Transactions series

点击查看摘要

Abstract:Feature selection is critical for improving the performance and interpretability of machine learning models, particularly in high-dimensional spaces where complex feature interactions can reduce accuracy and increase computational demands. Existing approaches often rely on static feature subsets or manual intervention, limiting adaptability and scalability. However, dynamic, per-instance feature selection methods and model-specific interpretability in reinforcement learning remain underexplored. This study proposes a human-in-the-loop (HITL) feature selection framework integrated into a Double Deep Q-Network (DDQN) using a Kolmogorov-Arnold Network (KAN). Our novel approach leverages simulated human feedback and stochastic distribution-based sampling, specifically Beta, to iteratively refine feature subsets per data instance, improving flexibility in feature selection. The KAN-DDQN achieved notable test accuracies of 93% on MNIST and 83% on FashionMNIST, outperforming conventional MLP-DDQN models by up to 9%. The KAN-based model provided high interpretability via symbolic representation while using 4 times fewer neurons in the hidden layer than MLPs did. Comparatively, the models without feature selection achieved test accuracies of only 58% on MNIST and 64% on FashionMNIST, highlighting significant gains with our framework. Pruning and visualization further enhanced model transparency by elucidating decision pathways. These findings present a scalable, interpretable solution for feature selection that is suitable for applications requiring real-time, adaptive decision-making with minimal human oversight.

[LG-25] Reducing Hyperparameter Tuning Costs in ML Vision and Language Model Training Pipelines via Memoization-Awareness

链接: https://arxiv.org/abs/2411.03731
作者: Abdelmajid Essofi,Ridwan Salahuddeen,Munachiso Nwadike,Elnura Zhalieva,Kun Zhang,Eric Xing,Willie Neiswanger,Qirong Ho
关键词-EN: encompassing data preparation, stages encompassing data, data preparation, sequence of stages, stages encompassing
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The training or fine-tuning of machine learning, vision, and language models is often implemented as a pipeline: a sequence of stages encompassing data preparation, model training and evaluation. In this paper, we exploit pipeline structures to reduce the cost of hyperparameter tuning for model training/fine-tuning, which is particularly valuable for language models given their high costs in GPU-days. We propose a “memoization-aware” Bayesian Optimization (BO) algorithm, EEIPU, that works in tandem with a pipeline caching system, allowing it to evaluate significantly more hyperparameter candidates per GPU-day than other tuning algorithms. The result is better-quality hyperparameters in the same amount of search time, or equivalently, reduced search time to reach the same hyperparameter quality. In our benchmarks on machine learning (model ensembles), vision (convolutional architecture) and language (T5 architecture) pipelines, we compare EEIPU against recent BO algorithms: EEIPU produces an average of 103% more hyperparameter candidates (within the same budget), and increases the validation metric by an average of 108% more than other algorithms (where the increase is measured starting from the end of warm-up iterations).

[LG-26] Generalized Trusted Multi-view Classification Framework with Hierarchical Opinion Aggregation

链接: https://arxiv.org/abs/2411.03713
作者: Long Shi,Chuanqing Tang,Huangyi Deng,Cai Xu,Lei Xing,Badong Chen
关键词-EN: Trusted Multi-view Classification, Trusted Multi-view, Multi-view Classification, witnessed a considerable, considerable interest
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, multi-view learning has witnessed a considerable interest on the research of trusted decision-making. Previous methods are mainly inspired from an important paper published by Han et al. in 2021, which formulates a Trusted Multi-view Classification (TMC) framework that aggregates evidence from different views based on Dempster’s combination rule. All these methods only consider inter-view aggregation, yet lacking exploitation of intra-view information. In this paper, we propose a generalized trusted multi-view classification framework with hierarchical opinion aggregation. This hierarchical framework includes a two-phase aggregation process: the intra-view and inter-view aggregation hierarchies. In the intra aggregation, we assume that each view is comprised of common information shared with other views, as well as its specific information. We then aggregate both the common and specific information. This aggregation phase is useful to eliminate the feature noise inherent to view itself, thereby improving the view quality. In the inter-view aggregation, we design an attention mechanism at the evidence level to facilitate opinion aggregation from different views. To the best of our knowledge, this is one of the pioneering efforts to formulate a hierarchical aggregation framework in the trusted multi-view learning domain. Extensive experiments show that our model outperforms some state-of-art trust-related baselines.

[LG-27] Multi-model Ensemble Conformal Prediction in Dynamic Environments

链接: https://arxiv.org/abs/2411.03678
作者: Erfan Hajihashemi,Yanning Shen
关键词-EN: previously unseen datum, predetermined coverage probability, Conformal prediction, Adaptive conformal prediction, uncertainty quantification method
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conformal prediction is an uncertainty quantification method that constructs a prediction set for a previously unseen datum, ensuring the true label is included with a predetermined coverage probability. Adaptive conformal prediction has been developed to address data distribution shifts in dynamic environments. However, the efficiency of prediction sets varies depending on the learning model used. Employing a single fixed model may not consistently offer the best performance in dynamic environments with unknown data distribution shifts. To address this issue, we introduce a novel adaptive conformal prediction framework, where the model used for creating prediction sets is selected on the fly from multiple candidate models. The proposed algorithm is proven to achieve strongly adaptive regret over all intervals while maintaining valid coverage. Experiments on real and synthetic datasets corroborate that the proposed approach consistently yields more efficient prediction sets while maintaining valid coverage, outperforming alternative methods.

[LG-28] Energy-based physics-informed neural network for frictionless contact problems under large deformation

链接: https://arxiv.org/abs/2411.03671
作者: Jinshuai Bai,Zhongya Lin,Yizheng Wang,Jiancong Wen,Yinghua Liu,Timon Rabczuk,YuanTong Gu,Xi-Qiao Feng
关键词-EN: proposed PINNs framework, proposed PINNs, PINNs framework, engineering applications, enabling the prediction
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 22 pages, 9 figures

点击查看摘要

Abstract:Numerical methods for contact mechanics are of great importance in engineering applications, enabling the prediction and analysis of complex surface interactions under various conditions. In this work, we propose an energy-based physics-informed neural network (PINNs) framework for solving frictionless contact problems under large deformation. Inspired by microscopic Lennard-Jones potential, a surface contact energy is used to describe the contact phenomena. To ensure the robustness of the proposed PINN framework, relaxation, gradual loading and output scaling techniques are introduced. In the numerical examples, the well-known Hertz contact benchmark problem is conducted, demonstrating the effectiveness and robustness of the proposed PINNs framework. Moreover, challenging contact problems with the consideration of geometrical and material nonlinearities are tested. It has been shown that the proposed PINNs framework provides a reliable and powerful tool for nonlinear contact mechanics. More importantly, the proposed PINNs framework exhibits competitive computational efficiency to the commercial FEM software when dealing with those complex contact problems. The codes used in this manuscript are available at this https URL code will be available after acceptance)

[LG-29] Can Graph Neural Networks Expose Training Data Properties? An Efficient Risk Assessment Approach NEURIPS’24

链接: https://arxiv.org/abs/2411.03663
作者: Hanyang Yuan,Jiarong Xu,Renhong Huang,Mingli Song,Chunping Wang,Yang Yang
关键词-EN: attracted considerable attention, considerable attention due, diverse applications, attracted considerable, considerable attention
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: In NeurIPS’24

点击查看摘要

Abstract:Graph neural networks (GNNs) have attracted considerable attention due to their diverse applications. However, the scarcity and quality limitations of graph data present challenges to their training process in practical settings. To facilitate the development of effective GNNs, companies and researchers often seek external collaboration. Yet, directly sharing data raises privacy concerns, motivating data owners to train GNNs on their private graphs and share the trained models. Unfortunately, these models may still inadvertently disclose sensitive properties of their training graphs (e.g., average default rate in a transaction network), leading to severe consequences for data owners. In this work, we study graph property inference attack to identify the risk of sensitive property information leakage from shared models. Existing approaches typically train numerous shadow models for developing such attack, which is computationally intensive and impractical. To address this issue, we propose an efficient graph property inference attack by leveraging model approximation techniques. Our method only requires training a small set of models on graphs, while generating a sufficient number of approximated shadow models for attacks. To enhance diversity while reducing errors in the approximated models, we apply edit distance to quantify the diversity within a group of approximated models and introduce a theoretically guaranteed criterion to evaluate each model’s error. Subsequently, we propose a novel selection mechanism to ensure that the retained approximated models achieve high diversity and low error. Extensive experiments across six real-world scenarios demonstrate our method’s substantial improvement, with average increases of 2.7% in attack accuracy and 4.1% in ROC-AUC, while being 6.5 \times faster compared to the best baseline.

[LG-30] Constrained Multi-objective Bayesian Optimization through Optimistic Constraints Estimation

链接: https://arxiv.org/abs/2411.03641
作者: Diantong Li,Fengxue Zhang,Chong Liu,Yuxin Chen
关键词-EN: including drug discovery, scientific experiment design, Multi-objective Bayesian optimization, experiment design, including drug
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Multi-objective Bayesian optimization has been widely adopted in scientific experiment design, including drug discovery and hyperparameter optimization. In practice, regulatory or safety concerns often impose additional thresholds on certain attributes of the experimental outcomes. Previous work has primarily focused on constrained single-objective optimization tasks or active search under constraints. We propose CMOBO, a sample-efficient constrained multi-objective Bayesian optimization algorithm that balances learning of the feasible region (defined on multiple unknowns) with multi-objective optimization within the feasible region in a principled manner. We provide both theoretical justification and empirical evidence, demonstrating the efficacy of our approach on various synthetic benchmarks and real-world applications.

[LG-31] SEGMN: A Structure-Enhanced Graph Matching Network for Graph Similarity Learning

链接: https://arxiv.org/abs/2411.03624
作者: Wenjun Wang,Jiacheng Lu,Kejia Chen,Zheng Liu,Shilong Sang
关键词-EN: structure perception matching, Graph similarity computation, perception matching module, structure, GSC methods
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Graph similarity computation (GSC) aims to quantify the similarity score between two graphs. Although recent GSC methods based on graph neural networks (GNNs) take advantage of intra-graph structures in message passing, few of them fully utilize the structures presented by edges to boost the representation of their connected nodes. Moreover, previous cross-graph node embedding matching lacks the perception of the overall structure of the graph pair, due to the fact that the node representations from GNNs are confined to the intra-graph structure, causing the unreasonable similarity score. Intuitively, the cross-graph structure represented in the assignment graph is helpful to rectify the inappropriate matching. Therefore, we propose a structure-enhanced graph matching network (SEGMN). Equipped with a dual embedding learning module and a structure perception matching module, SEGMN achieves structure enhancement in both embedding learning and cross-graph matching. The dual embedding learning module incorporates adjacent edge representation into each node to achieve a structure-enhanced representation. The structure perception matching module achieves cross-graph structure enhancement through assignment graph convolution. The similarity score of each cross-graph node pair can be rectified by aggregating messages from structurally relevant node pairs. Experimental results on benchmark datasets demonstrate that SEGMN outperforms the state-of-the-art GSC methods in the GED regression task, and the structure perception matching module is plug-and-play, which can further improve the performance of the baselines by up to 25%.

[LG-32] mporal-Difference Learning Using Distributed Error Signals NEURIPS2024

链接: https://arxiv.org/abs/2411.03604
作者: Jonas Guan,Shon Eduard Verch,Claas Voelcker,Ethan C. Jackson,Nicolas Papernot,William A. Cunningham
关键词-EN: nucleus accumbens, computational problem, problem in biological, biological reward-based learning, credit assignment
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 10 pages, to be published at NeurIPS 2024

点击查看摘要

Abstract:A computational problem in biological reward-based learning is how credit assignment is performed in the nucleus accumbens (NAc). Much research suggests that NAc dopamine encodes temporal-difference (TD) errors for learning value predictions. However, dopamine is synchronously distributed in regionally homogeneous concentrations, which does not support explicit credit assignment (like used by backpropagation). It is unclear whether distributed errors alone are sufficient for synapses to make coordinated updates to learn complex, nonlinear reward-based learning tasks. We design a new deep Q-learning algorithm, Artificial Dopamine, to computationally demonstrate that synchronously distributed, per-layer TD errors may be sufficient to learn surprisingly complex RL tasks. We empirically evaluate our algorithm on MinAtar, the DeepMind Control Suite, and classic control tasks, and show it often achieves comparable performance to deep RL algorithms that use backpropagation.

[LG-33] Open-Source High-Speed Flight Surrogate Modeling Framework

链接: https://arxiv.org/abs/2411.03598
作者: Tyler E. Korenyi-Both,Nathan J. Falkiewicz,Matthew C. Jones
关键词-EN: High-speed flight vehicles, High-speed flight, speed of sound, space exploration, travel much faster
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:High-speed flight vehicles, which travel much faster than the speed of sound, are crucial for national defense and space exploration. However, accurately predicting their behavior under numerous, varied flight conditions is a challenge and often prohibitively expensive. The proposed approach involves creating smarter, more efficient machine learning models (also known as surrogate models or meta models) that can fuse data generated from a variety of fidelity levels – to include engineering methods, simulation, wind tunnel, and flight test data – to make more accurate predictions. These models are able to move the bulk of the computation from high performance computing (HPC) to single user machines (laptop, desktop, etc.). The project builds upon previous work but introduces code improvements and an informed perspective on the direction of the field. The new surrogate modeling framework is now modular and, by design, broadly applicable to many modeling problems. The new framework also has a more robust automatic hyperparameter tuning capability and abstracts away most of the pre- and post-processing tasks. The Gaussian process regression and deep neural network-based models included in the presented framework were able to model two datasets with high accuracy (R^20.99). The primary conclusion is that the framework is effective and has been delivered to the Air Force for integration into real-world projects. For future work, significant and immediate investment in continued research is crucial. The author recommends further testing and refining modeling methods that explicitly incorporate physical laws and are robust enough to handle simulation and test data from varying resolutions and sources, including coarse meshes, fine meshes, unstructured meshes, and limited experimental test points.

[LG-34] Enhancing the Expressivity of Temporal Graph Networks through Source-Target Identification NEURIPS

链接: https://arxiv.org/abs/2411.03596
作者: Benedict Aaron Tjandra,Federico Barbero,Michael Bronstein
关键词-EN: Temporal Graph Networks, Graph Networks, dynamic node affinity, node affinity prediction, Temporal Graph
类目: Machine Learning (cs.LG)
*备注: Accepted to NeurIPS Symmetry and Geometry in Neural Representations Workshop 2024

点击查看摘要

Abstract:Despite the successful application of Temporal Graph Networks (TGNs) for tasks such as dynamic node classification and link prediction, they still perform poorly on the task of dynamic node affinity prediction – where the goal is to predict `how much’ two nodes will interact in the future. In fact, simple heuristic approaches such as persistent forecasts and moving averages over \emphground-truth labels significantly and consistently outperform TGNs. Building on this observation, we find that computing heuristics \textitover messages is an equally competitive approach, outperforming TGN and all current temporal graph (TG) models on dynamic node affinity prediction. In this paper, we prove that no formulation of TGN can represent persistent forecasting or moving averages over messages, and propose to enhance the expressivity of TGNs by adding source-target identification to each interaction event message. We show that this modification is required to represent persistent forecasting, moving averages, and the broader class of autoregressive models over messages. Our proposed method, TGNv2, significantly outperforms TGN and all current TG models on all Temporal Graph Benchmark (TGB) dynamic node affinity prediction datasets.

[LG-35] Learning Constant-Depth Circuits in Malicious Noise Models

链接: https://arxiv.org/abs/2411.03570
作者: Adam R. Klivans,Konstantinos Stavropoulos,Arsen Vasilyan
关键词-EN: Nisan gave, gave a quasipolynomial-time, work of Linial, Mansour, Linial
类目: Data Structures and Algorithms (cs.DS); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The seminal work of Linial, Mansour, and Nisan gave a quasipolynomial-time algorithm for learning constant-depth circuits ( \mathsfAC^0 ) with respect to the uniform distribution on the hypercube. Extending their algorithm to the setting of malicious noise, where both covariates and labels can be adversarially corrupted, has remained open. Here we achieve such a result, inspired by recent work on learning with distribution shift. Our running time essentially matches their algorithm, which is known to be optimal assuming various cryptographic primitives. Our proof uses a simple outlier-removal method combined with Braverman’s theorem for fooling constant-depth circuits. We attain the best possible dependence on the noise rate and succeed in the harshest possible noise model (i.e., contamination or so-called “nasty noise”). Subjects: Data Structures and Algorithms (cs.DS); Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2411.03570 [cs.DS] (or arXiv:2411.03570v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2411.03570 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-36] Do Mice Grok? Glimpses of Hidden Progress During Overtraining in Sensory Cortex

链接: https://arxiv.org/abs/2411.03541
作者: Tanishq Kumar,Blake Bordelon,Cengiz Pehlevan,Venkatesh N. Murthy,Samuel J. Gershman
关键词-EN: behavior stops changing, task-relevant representations stop, stops changing, learning, behavior stops
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Does learning of task-relevant representations stop when behavior stops changing? Motivated by recent theoretical advances in machine learning and the intuitive observation that human experts continue to learn from practice even after mastery, we hypothesize that task-specific representation learning can continue, even when behavior plateaus. In a novel reanalysis of recently published neural data, we find evidence for such learning in posterior piriform cortex of mice following continued training on a task, long after behavior saturates at near-ceiling performance (“overtraining”). This learning is marked by an increase in decoding accuracy from piriform neural populations and improved performance on held-out generalization tests. We demonstrate that class representations in cortex continue to separate during overtraining, so that examples that were incorrectly classified at the beginning of overtraining can abruptly be correctly classified later on, despite no changes in behavior during that time. We hypothesize this hidden yet rich learning takes the form of approximate margin maximization; we validate this and other predictions in the neural data, as well as build and interpret a simple synthetic model that recapitulates these phenomena. We conclude by showing how this model of late-time feature learning implies an explanation for the empirical puzzle of overtraining reversal in animal learning, where task-specific representations are more robust to particular task changes because the learned features can be reused.

[LG-37] PACE: Pacing Operator Learning to Accurate Optical Field Simulation for Complicated Photonic Devices NEURIPS2024

链接: https://arxiv.org/abs/2411.03527
作者: Hanqing Zhu,Wenyan Cong,Guojin Chen,Shupeng Ning,Ray T. Chen,Jiaqi Gu,David Z. Pan
关键词-EN: Electromagnetic field simulation, Electromagnetic field, validating photonic devices, central to designing, photonic devices
类目: Machine Learning (cs.LG); Optics (physics.optics)
*备注: Accepeted by Neurips 2024, 21 pages

点击查看摘要

Abstract:Electromagnetic field simulation is central to designing, optimizing, and validating photonic devices and circuits. However, costly computation associated with numerical simulation poses a significant bottleneck, hindering scalability and turnaround time in the photonic circuit design process. Neural operators offer a promising alternative, but existing SOTA approaches, NeurOLight, struggle with predicting high-fidelity fields for real-world complicated photonic devices, with the best reported 0.38 normalized mean absolute error in NeurOLight. The inter-plays of highly complex light-matter interaction, e.g., scattering and resonance, sensitivity to local structure details, non-uniform learning complexity for full-domain simulation, and rich frequency information, contribute to the failure of existing neural PDE solvers. In this work, we boost the prediction fidelity to an unprecedented level for simulating complex photonic devices with a novel operator design driven by the above challenges. We propose a novel cross-axis factorized PACE operator with a strong long-distance modeling capacity to connect the full-domain complex field pattern with local device structures. Inspired by human learning, we further divide and conquer the simulation task for extremely hard cases into two progressively easy tasks, with a first-stage model learning an initial solution refined by a second model. On various complicated photonic device benchmarks, we demonstrate one sole PACE model is capable of achieving 73% lower error with 50% fewer parameters compared with various recent ML for PDE solvers. The two-stage setup further advances high-fidelity simulation for even more intricate cases. In terms of runtime, PACE demonstrates 154-577x and 11.8-12x simulation speedup over numerical solver using scipy or highly-optimized pardiso solver, respectively. We open sourced the code and dataset.

[LG-38] Understanding Contrastive Learning via Gaussian Mixture Models

链接: https://arxiv.org/abs/2411.03517
作者: Parikshit Bansal,Ali Kavis,Sujay Sanghavi
关键词-EN: encourages the embedding, embeddings of random, Contrastive learning attempts, Gaussian Mixture Models, Contrastive learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contrastive learning attempts to learn representations from un-labeled data; it does so via a loss function that encourages the embedding of a point to be close to that of its augmentations, and far from the embeddings of random other points. This simple idea performs remarkably well, yet it is not precisely theoretically understood why this is the case. In this paper we analyze contrastive learning (specifically, the InfoNCE loss) in a natural context: dimensionality reduction in Gaussian Mixture Models. Crucially, we define an augmentation of a data point as being another independent draw from the same underlying mixture component. We show that vanilla InfoNCE is able to find the optimal lower-dimensional subspace even when the Gaussians are not isotropic – something that vanilla spectral techniques cannot do. We further extend our analyses to multi-modal contrastive learning algorithms (e.g., CLIP). In this setting we show that contrastive learning learns the subset of fisher-optimal subspace, effectively filtering out all the noise from the learnt representations.

[LG-39] An Open-source Sim2Real Approach for Sensor-independent Robot Navigation in a Grid ICRA

链接: https://arxiv.org/abs/2411.03494
作者: Murad Mehrab Abrar,Souryadeep Mondal,Michelle Hickner
关键词-EN: Gymnasium Frozen Lake, Frozen Lake simulation, Application Programming Interface, Frozen Lake environment, Frozen Lake
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted for publication at the 9th IEEE International Conference on Robotics and Automation Engineering (IEEE ICRAE 2024), Singapore

点击查看摘要

Abstract:This paper presents a Sim2Real (Simulation to Reality) approach to bridge the gap between a trained agent in a simulated environment and its real-world implementation in navigating a robot in a similar setting. Specifically, we focus on navigating a quadruped robot in a real-world grid-like environment inspired by the Gymnasium Frozen Lake – a highly user-friendly and free Application Programming Interface (API) to develop and test Reinforcement Learning (RL) algorithms. We detail the development of a pipeline to transfer motion policies learned in the Frozen Lake simulation to a physical quadruped robot, thus enabling autonomous navigation and obstacle avoidance in a grid without relying on expensive localization and mapping sensors. The work involves training an RL agent in the Frozen Lake environment and utilizing the resulting Q-table to control a 12 Degrees-of-Freedom (DOF) quadruped robot. In addition to detailing the RL implementation, inverse kinematics-based quadruped gaits, and the transfer policy pipeline, we open-source the project on GitHub and include a demonstration video of our Sim2Real transfer approach. This work provides an accessible, straightforward, and low-cost framework for researchers, students, and hobbyists to explore and implement RL-based robot navigation in real-world grid environments.

[LG-40] Pathway-Guided Optimization of Deep Generative Molecular Design Models for Cancer Therapy

链接: https://arxiv.org/abs/2411.03460
作者: Alif Bin Abdul Qayyum,Susan D. Mertins,Amanda K. Paulson,Nathan M. Urban,Byung-Jun Yoon
关键词-EN: potentially expensive black-box, expensive black-box objective, black-box objective function, drug design problem, structured molecular space
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:The data-driven drug design problem can be formulated as an optimization task of a potentially expensive black-box objective function over a huge high-dimensional and structured molecular space. The junction tree variational autoencoder (JTVAE) has been shown to be an efficient generative model that can be used for suggesting legitimate novel drug-like small molecules with improved properties. While the performance of the generative molecular design (GMD) scheme strongly depends on the initial training data, one can improve its sampling efficiency for suggesting better molecules with enhanced properties by optimizing the latent space. In this work, we propose how mechanistic models - such as pathway models described by differential equations - can be used for effective latent space optimization(LSO) of JTVAEs and other similar models for GMD. To demonstrate the potential of our proposed approach, we show how a pharmacodynamic model, assessing the therapeutic efficacy of a drug-like small molecule by predicting how it modulates a cancer pathway, can be incorporated for effective LSO of data-driven models for GMD.

[LG-41] Fourier Analysis of Variational Quantum Circuits for Supervised Learning

链接: https://arxiv.org/abs/2411.03450
作者: Marco Wiedmann,Maniraman Periyasamy,Daniel D. Scherer
关键词-EN: truncated Fourier sum, Fourier analysis, truncated Fourier, Fourier sum, Fourier
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:VQC can be understood through the lens of Fourier analysis. It is already well-known that the function space represented by any circuit architecture can be described through a truncated Fourier sum. We show that the spectrum available to that truncated Fourier sum is not entirely determined by the encoding gates of the circuit, since the variational part of the circuit can constrain certain coefficients to zero, effectively removing that frequency from the spectrum. To the best of our knowledge, we give the first description of the functional dependence of the Fourier coefficients on the variational parameters as trigonometric polynomials. This allows us to provide an algorithm which computes the exact spectrum of any given circuit and the corresponding Fourier coefficients. Finally, we demonstrate that by comparing the Fourier transform of the dataset to the available spectra, it is possible to predict which \glsVQC out of a given list of choices will be able to best fit the data.

[LG-42] Quantifying Aleatoric Uncertainty of the Treatment Effect: A Novel Orthogonal Learner

链接: https://arxiv.org/abs/2411.03387
作者: Valentyn Melnychuk,Stefan Feuerriegel,Mihaela van der Schaar
关键词-EN: treatment effect, Estimating causal quantities, treatment, aleatoric uncertainty, causal quantities
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Estimating causal quantities from observational data is crucial for understanding the safety and effectiveness of medical treatments. However, to make reliable inferences, medical practitioners require not only estimating averaged causal quantities, such as the conditional average treatment effect, but also understanding the randomness of the treatment effect as a random variable. This randomness is referred to as aleatoric uncertainty and is necessary for understanding the probability of benefit from treatment or quantiles of the treatment effect. Yet, the aleatoric uncertainty of the treatment effect has received surprisingly little attention in the causal machine learning community. To fill this gap, we aim to quantify the aleatoric uncertainty of the treatment effect at the covariate-conditional level, namely, the conditional distribution of the treatment effect (CDTE). Unlike average causal quantities, the CDTE is not point identifiable without strong additional assumptions. As a remedy, we employ partial identification to obtain sharp bounds on the CDTE and thereby quantify the aleatoric uncertainty of the treatment effect. We then develop a novel, orthogonal learner for the bounds on the CDTE, which we call AU-learner. We further show that our AU-learner has several strengths in that it satisfies Neyman-orthogonality and is doubly robust. Finally, we propose a fully-parametric deep learning instantiation of our AU-learner.

[LG-43] Kernel Approximation using Analog In-Memory Computing

链接: https://arxiv.org/abs/2411.03375
作者: Julian Büchel,Giacomo Camposampiero,Athanasios Vasilopoulos,Corey Lammie,Manuel Le Gallo,Abbas Rahimi,Abu Sebastian
关键词-EN: machine learning algorithms, incur significant memory, Analog In-Memory Computing, machine learning, computational costs
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Kernel functions are vital ingredients of several machine learning algorithms, but often incur significant memory and computational costs. We introduce an approach to kernel approximation in machine learning algorithms suitable for mixed-signal Analog In-Memory Computing (AIMC) architectures. Analog In-Memory Kernel Approximation addresses the performance bottlenecks of conventional kernel-based methods by executing most operations in approximate kernel methods directly in memory. The IBM HERMES Project Chip, a state-of-the-art phase-change memory based AIMC chip, is utilized for the hardware demonstration of kernel approximation. Experimental results show that our method maintains high accuracy, with less than a 1% drop in kernel-based ridge classification benchmarks and within 1% accuracy on the Long Range Arena benchmark for kernelized attention in Transformer neural networks. Compared to traditional digital accelerators, our approach is estimated to deliver superior energy efficiency and lower power consumption. These findings highlight the potential of heterogeneous AIMC architectures to enhance the efficiency and scalability of machine learning applications.

[LG-44] Energy Price Modelling: A Comparative Evaluation of four Generations of Forecasting Methods

链接: https://arxiv.org/abs/2411.03372
作者: Alexandru-Victor Andrei,Georg Velev,Filip-Mihai Toma,Daniel Traian Pele,Stefan Lessmann
关键词-EN: modern economic systems, economic systems, energy price forecasting, driver of modern, modern economic
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Energy is a critical driver of modern economic systems. Accurate energy price forecasting plays an important role in supporting decision-making at various levels, from operational purchasing decisions at individual business organizations to policy-making. A significant body of literature has looked into energy price forecasting, investigating a wide range of methods to improve accuracy and inform these critical decisions. Given the evolving landscape of forecasting techniques, the literature lacks a thorough empirical comparison that systematically contrasts these methods. This paper provides an in-depth review of the evolution of forecasting modeling frameworks, from well-established econometric models to machine learning methods, early sequence learners such LSTMs, and more recent advancements in deep learning with transformer networks, which represent the cutting edge in forecasting. We offer a detailed review of the related literature and categorize forecasting methodologies into four model families. We also explore emerging concepts like pre-training and transfer learning, which have transformed the analysis of unstructured data and hold significant promise for time series forecasting. We address a gap in the literature by performing a comprehensive empirical analysis on these four family models, using data from the EU energy markets, we conduct a large-scale empirical study, which contrasts the forecasting accuracy of different approaches, focusing especially on alternative propositions for time series transformers. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2411.03372 [cs.LG] (or arXiv:2411.03372v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.03372 Focus to learn more arXiv-issued DOI via DataCite

[LG-45] DM4Steal: Diffusion Model For Link Stealing Attack On Graph Neural Networks

链接: https://arxiv.org/abs/2411.03364
作者: Jinyin Chen,Haonan Ma,Haibin Zheng
关键词-EN: graph neural network, GNN, neural network, fast development, target GNN model
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph has become increasingly integral to the advancement of recommendation systems, particularly with the fast development of graph neural network(GNN). By exploring the virtue of rich node features and link information, GNN is designed to provide personalized and accurate suggestions. Meanwhile, the privacy leakage of GNN in such contexts has also captured special attention. Prior work has revealed that a malicious user can utilize auxiliary knowledge to extract sensitive link data of the target graph, integral to recommendation systems, via the decision made by the target GNN model. This poses a significant risk to the integrity and confidentiality of data used in recommendation system. Though important, previous works on GNN’s privacy leakage are still challenged in three aspects, i.e., limited stealing attack scenarios, sub-optimal attack performance, and adaptation against defense. To address these issues, we propose a diffusion model based link stealing attack, named DM4Steal. It differs previous work from three critical aspects. (i) Generality: aiming at six attack scenarios with limited auxiliary knowledge, we propose a novel training strategy for diffusion models so that DM4Steal is transferable to diverse attack scenarios. (ii) Effectiveness: benefiting from the retention of semantic structure in the diffusion model during the training process, DM4Steal is capable to learn the precise topology of the target graph through the GNN decision process. (iii) Adaptation: when GNN is defensive (e.g., DP, Dropout), DM4Steal relies on the stability that comes from sampling the score model multiple times to keep performance degradation to a minimum, thus DM4Steal implements successful adaptive attack on defensive GNN.

[LG-46] DDBench: A Benchmark for Training data detection

链接: https://arxiv.org/abs/2411.03363
作者: Zhihao Zhu,Yi Yang,Defu Lian
关键词-EN: Membership Inference Attack, machine learning model, TDD, Inference Attack, task aimed
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training Data Detection (TDD) is a task aimed at determining whether a specific data instance is used to train a machine learning model. In the computer security literature, TDD is also referred to as Membership Inference Attack (MIA). Given its potential to assess the risks of training data breaches, ensure copyright authentication, and verify model unlearning, TDD has garnered significant attention in recent years, leading to the development of numerous methods. Despite these advancements, there is no comprehensive benchmark to thoroughly evaluate the effectiveness of TDD methods. In this work, we introduce TDDBench, which consists of 13 datasets spanning three data modalities: image, tabular, and text. We benchmark 21 different TDD methods across four detection paradigms and evaluate their performance from five perspectives: average detection performance, best detection performance, memory consumption, and computational efficiency in both time and memory. With TDDBench, researchers can identify bottlenecks and areas for improvement in TDD algorithms, while practitioners can make informed trade-offs between effectiveness and efficiency when selecting TDD algorithms for specific use cases. Our large-scale benchmarking also reveals the generally unsatisfactory performance of TDD algorithms across different datasets. To enhance accessibility and reproducibility, we open-source TDDBench for the research community.

[LG-47] Pedestrian Volume Prediction Using a Diffusion Convolutional Gated Recurrent Unit Model

链接: https://arxiv.org/abs/2411.03360
作者: Yiwei Dong,Tingjin Chu,Lele Zhang,Hadi Ghaderi,Hanfang Yang
关键词-EN: Grated Recurrent Unit, predicting pedestrian flow, road users, Effective models, analysing and predicting
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Effective models for analysing and predicting pedestrian flow are important to ensure the safety of both pedestrians and other road users. These tools also play a key role in optimising infrastructure design and geometry and supporting the economic utility of interconnected communities. The implementation of city-wide automatic pedestrian counting systems provides researchers with invaluable data, enabling the development and training of deep learning applications that offer better insights into traffic and crowd flows. Benefiting from real-world data provided by the City of Melbourne pedestrian counting system, this study presents a pedestrian flow prediction model, as an extension of Diffusion Convolutional Grated Recurrent Unit (DCGRU) with dynamic time warping, named DCGRU-DTW. This model captures the spatial dependencies of pedestrian flow through the diffusion process and the temporal dependency captured by Gated Recurrent Unit (GRU). Through extensive numerical experiments, we demonstrate that the proposed model outperforms the classic vector autoregressive model and the original DCGRU across multiple model accuracy metrics.

[LG-48] SPINEX_ Symbolic Regression: Similarity-based Symbolic Regression with Explainable Neighbors Exploration

链接: https://arxiv.org/abs/2411.03358
作者: MZ Naser,Ahmed Z Naser
关键词-EN: Explainable Neighbors Exploration, Neighbors Exploration, Predictions with Explainable, Explainable Neighbors, Similarity-based Predictions
类目: Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:This article introduces a new symbolic regression algorithm based on the SPINEX (Similarity-based Predictions with Explainable Neighbors Exploration) family. This new algorithm (SPINEX_SymbolicRegression) adopts a similarity-based approach to identifying high-merit expressions that satisfy accuracy- and structural similarity metrics. We conducted extensive benchmarking tests comparing SPINEX_SymbolicRegression to over 180 mathematical benchmarking functions from international problem sets that span randomly generated expressions and those based on real physical phenomena. Then, we evaluated the performance of the proposed algorithm in terms of accuracy, expression similarity in terms of presence operators and variables (as compared to the actual expressions), population size, and number of generations at convergence. The results indicate that SPINEX_SymbolicRegression consistently performs well and can, in some instances, outperform leading algorithms. In addition, the algorithm’s explainability capabilities are highlighted through in-depth experiments.

[LG-49] Hypergraphs as Weighted Directed Self-Looped Graphs: Spectral Properties Clustering Cheeger Inequality

链接: https://arxiv.org/abs/2411.03331
作者: Zihao Li,Dongqi Fu,Hengyu Liu,Jingrui He
关键词-EN: studying group relations, Hypergraphs naturally arise, EDVW hypergraphs, machine learning, hypergraph Cheeger Inequality
类目: ocial and Information Networks (cs.SI); Discrete Mathematics (cs.DM); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: Preprint, 31 pages

点击查看摘要

Abstract:Hypergraphs naturally arise when studying group relations and have been widely used in the field of machine learning. There has not been a unified formulation of hypergraphs, yet the recently proposed edge-dependent vertex weights (EDVW) modeling is one of the most generalized modeling methods of hypergraphs, i.e., most existing hypergraphs can be formulated as EDVW hypergraphs without any information loss to the best of our knowledge. However, the relevant algorithmic developments on EDVW hypergraphs remain nascent: compared to spectral graph theories, the formulations are incomplete, the spectral clustering algorithms are not well-developed, and one result regarding hypergraph Cheeger Inequality is even incorrect. To this end, deriving a unified random walk-based formulation, we propose our definitions of hypergraph Rayleigh Quotient, NCut, boundary/cut, volume, and conductance, which are consistent with the corresponding definitions on graphs. Then, we prove that the normalized hypergraph Laplacian is associated with the NCut value, which inspires our HyperClus-G algorithm for spectral clustering on EDVW hypergraphs. Finally, we prove that HyperClus-G can always find an approximately linearly optimal partitioning in terms of Both NCut and conductance. Additionally, we provide extensive experiments to validate our theoretical findings from an empirical perspective.

[LG-50] Foundation Models for Rapid Autonomy Validation

链接: https://arxiv.org/abs/2411.03328
作者: Alec Farid,Peter Schleede,Aaron Huang,Christoffer Heckman
关键词-EN: vehicle performance validation, autonomous vehicle performance, autonomous vehicle, performance validation, autonomous vehicle requires
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We are motivated by the problem of autonomous vehicle performance validation. A key challenge is that an autonomous vehicle requires testing in every kind of driving scenario it could encounter, including rare events, to provide a strong case for safety and show there is no edge-case pathological behavior. Autonomous vehicle companies rely on potentially millions of miles driven in realistic simulation to expose the driving stack to enough miles to estimate rates and severity of collisions. To address scalability and coverage, we propose the use of a behavior foundation model, specifically a masked autoencoder (MAE), trained to reconstruct driving scenarios. We leverage the foundation model in two complementary ways: we (i) use the learned embedding space to group qualitatively similar scenarios together and (ii) fine-tune the model to label scenario difficulty based on the likelihood of a collision upon re-simulation. We use the difficulty scoring as importance weighting for the groups of scenarios. The result is an approach which can more rapidly estimate the rates and severity of collisions by prioritizing hard scenarios while ensuring exposure to every kind of driving scenario.

[LG-51] A Surrogate Model for Quay Crane Scheduling Problem

链接: https://arxiv.org/abs/2411.03324
作者: Kikun Park,Hyerim Bae
关键词-EN: Quay Crane Scheduling, Crane Scheduling Problem, task scheduling problem, representative task scheduling, impact on productivity
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:In ports, a variety of tasks are carried out, and scheduling these tasks is crucial due to its significant impact on productivity, making the generation of precise plans essential. This study proposes a method to solve the Quay Crane Scheduling Problem (QCSP), a representative task scheduling problem in ports known to be NP-Hard, more quickly and accurately. First, the study suggests a method to create more accurate work plans for Quay Cranes (QCs) by learning from actual port data to accurately predict the working speed of QCs. Next, a Surrogate Model is proposed by combining a Machine Learning (ML) model with a Genetic Algorithm (GA), which is widely used to solve complex optimization problems, enabling faster and more precise exploration of solutions. Unlike methods that use fixed-dimensional chromosome encoding, the proposed methodology can provide solutions for encodings of various dimensions. To validate the performance of the newly proposed methodology, comparative experiments were conducted, demonstrating faster search speeds and improved fitness scores. The method proposed in this study can be applied not only to QCSP but also to various NP-Hard problems, and it opens up possibilities for the further development of advanced search algorithms by combining heuristic algorithms with ML models.

[LG-52] Satellite monitoring uncovers progress but large disparities in doubling crop yields

链接: https://arxiv.org/abs/2411.03322
作者: Katie Fankhauser,Evan Thomas,Zia Mehrabi
关键词-EN: High-resolution satellite-based crop, satellite-based crop yield, crop yield mapping, yield mapping offers, mapping offers enormous
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 5 pages, 3 figures/tables in main body; 20 pages, 13 figures/tables total including supplementary material and references; pre-print for submission undergoing review

点击查看摘要

Abstract:High-resolution satellite-based crop yield mapping offers enormous promise for monitoring progress towards the SDGs. Across 15,000 villages in Rwanda we uncover areas that are on and off track to double productivity by 2030. This machine learning enabled analysis is used to design spatially explicit productivity targets that, if met, would simultaneously ensure national goals without leaving anyone behind.

[LG-53] Learning Force Distribution Estimation for the GelSight Mini Optical Tactile Sensor Based on Finite Element Analysis

链接: https://arxiv.org/abs/2411.03315
作者: Erik Helmut,Luca Dziarski,Niklas Funk,Boris Belousov,Jan Peters
关键词-EN: Contact-rich manipulation remains, Contact-rich manipulation, challenge in robotics, manipulation remains, remains a major
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contact-rich manipulation remains a major challenge in robotics. Optical tactile sensors like GelSight Mini offer a low-cost solution for contact sensing by capturing soft-body deformations of the silicone gel. However, accurately inferring shear and normal force distributions from these gel deformations has yet to be fully addressed. In this work, we propose a machine learning approach using a U-net architecture to predict force distributions directly from the sensor’s raw images. Our model, trained on force distributions inferred from Finite Element Analysis (FEA), demonstrates promising accuracy in predicting normal and shear force distributions. It also shows potential for generalization across sensors of the same type and for enabling real-time application. The codebase, dataset and models are open-sourced and available at this https URL .

[LG-54] Partial Structure Discovery is Sufficient for No-regret Learning in Causal Bandits NEURIPS24

链接: https://arxiv.org/abs/2411.04054
作者: Muhammad Qasim Elahi,Mahsa Ghasemi,Murat Kocaoglu
关键词-EN: causal graph, decision variables, Causal, possibly optimal arms, latent confounders
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: To appear in Proceedings of NeurIPS 24

点击查看摘要

Abstract:Causal knowledge about the relationships among decision variables and a reward variable in a bandit setting can accelerate the learning of an optimal decision. Current works often assume the causal graph is known, which may not always be available a priori. Motivated by this challenge, we focus on the causal bandit problem in scenarios where the underlying causal graph is unknown and may include latent confounders. While intervention on the parents of the reward node is optimal in the absence of latent confounders, this is not necessarily the case in general. Instead, one must consider a set of possibly optimal arms/interventions, each being a special subset of the ancestors of the reward node, making causal discovery beyond the parents of the reward node essential. For regret minimization, we identify that discovering the full causal structure is unnecessary; however, no existing work provides the necessary and sufficient components of the causal graph. We formally characterize the set of necessary and sufficient latent confounders one needs to detect or learn to ensure that all possibly optimal arms are identified correctly. We also propose a randomized algorithm for learning the causal graph with a limited number of samples, providing a sample complexity guarantee for any desired confidence level. In the causal bandit setup, we propose a two-stage approach. In the first stage, we learn the induced subgraph on ancestors of the reward, along with a necessary and sufficient subset of latent confounders, to construct the set of possibly optimal arms. The regret incurred during this phase scales polynomially with respect to the number of nodes in the causal graph. The second phase involves the application of a standard bandit algorithm, such as the UCB algorithm. We also establish a regret bound for our two-phase approach, which is sublinear in the number of rounds.

[LG-55] Bayesian algorithmic perfumery: A Hierarchical Relevance Vector Machine for the Estimation of Personalized Fragrance Preferences based on Three Sensory Layers and Jungian Personality Archetypes

链接: https://arxiv.org/abs/2411.03965
作者: Rolando Gonzales Martinez
关键词-EN: Relevance Vector Machines, hierarchical Relevance Vector, Relevance Vector, integrating hierarchical Relevance, Jungian personality archetypes
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 15 pages, 0 figures

点击查看摘要

Abstract:This study explores a Bayesian algorithmic approach to personalized fragrance recommendation by integrating hierarchical Relevance Vector Machines (RVM) and Jungian personality archetypes. The paper proposes a structured model that links individual scent preferences for top, middle, and base notes to personality traits derived from Jungian archetypes, such as the Hero, Caregiver, and Explorer, among others. The algorithm utilizes Bayesian updating to dynamically refine predictions as users interact with each fragrance note. This iterative process allows for the personalization of fragrance experiences based on prior data and personality assessments, leading to adaptive and interpretable recommendations. By combining psychological theory with Bayesian machine learning, this approach addresses the complexity of modeling individual preferences while capturing user-specific and population-level trends. The study highlights the potential of hierarchical Bayesian frameworks in creating customized olfactory experiences, informed by psychological and demographic factors, contributing to advancements in personalized product design and machine learning applications in sensory-based industries.

[LG-56] Improved Regret of Linear Ensemble Sampling

链接: https://arxiv.org/abs/2411.03932
作者: Harin Lee,Min-hwan Oh
关键词-EN: linear ensemble sampling, ensemble sampling, linear bandit algorithms, linear ensemble, improved regret bound
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we close the fundamental gap of theory and practice by providing an improved regret bound for linear ensemble sampling. We prove that with an ensemble size logarithmic in T , linear ensemble sampling can achieve a frequentist regret bound of \tilde\mathcalO(d^3/2\sqrtT) , matching state-of-the-art results for randomized linear bandit algorithms, where d and T are the dimension of the parameter and the time horizon respectively. Our approach introduces a general regret analysis framework for linear bandit algorithms. Additionally, we reveal a significant relationship between linear ensemble sampling and Linear Perturbed-History Exploration (LinPHE), showing that LinPHE is a special case of linear ensemble sampling when the ensemble size equals T . This insight allows us to derive a new regret bound of \tilde\mathcalO(d^3/2\sqrtT) for LinPHE, independent of the number of arms. Our contributions advance the theoretical foundation of ensemble sampling, bringing its regret bounds in line with the best known bounds for other randomized exploration algorithms.

[LG-57] A Causal Framework for Precision Rehabilitation

链接: https://arxiv.org/abs/2411.03919
作者: R. James Cotton,Bryant A. Seamon,Richard L. Segal,Randal D. Davis,Amrita Sahu,Michelle M. McLeod,Pablo Celnik,Sharon L. Ramey
关键词-EN: optimizing individual rehabilitation, Precision rehabilitation offers, Precision rehabilitation, precision rehabilitation treatments, rehabilitation
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: keywords: rehabilitation; precision rehabilitation; causal inference; international classification of functioning; rehabilitation treatment specification system; computational neurorehabilitation

点击查看摘要

Abstract:Precision rehabilitation offers the promise of an evidence-based approach for optimizing individual rehabilitation to improve long-term functional outcomes. Emerging techniques, including those driven by artificial intelligence, are rapidly expanding our ability to quantify the different domains of function during rehabilitation, other encounters with healthcare, and in the community. While this seems poised to usher rehabilitation into the era of big data and should be a powerful driver of precision rehabilitation, our field lacks a coherent framework to utilize these data and deliver on this promise. We propose a framework that builds upon multiple existing pillars to fill this gap. Our framework aims to identify the Optimal Dynamic Treatment Regimens (ODTR), or the decision-making strategy that takes in the range of available measurements and biomarkers to identify interventions likely to maximize long-term function. This is achieved by designing and fitting causal models, which extend the Computational Neurorehabilitation framework using tools from causal inference. These causal models can learn from heterogeneous data from different silos, which must include detailed documentation of interventions, such as using the Rehabilitation Treatment Specification System. The models then serve as digital twins of patient recovery trajectories, which can be used to learn the ODTR. Our causal modeling framework also emphasizes quantitatively linking changes across levels of the functioning to ensure that interventions can be precisely selected based on careful measurement of impairments while also being selected to maximize outcomes that are meaningful to patients and stakeholders. We believe this approach can provide a unifying framework to leverage growing big rehabilitation data and AI-powered measurements to produce precision rehabilitation treatments that can improve clinical outcomes.

[LG-58] A Subsampling Based Neural Network for Spatial Data

链接: https://arxiv.org/abs/2411.03620
作者: Debjoy Thakur
关键词-EN: trending research problem, deep neural networks, neural networks, deep neural, neural
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The application of deep neural networks in geospatial data has become a trending research problem in the present day. A significant amount of statistical research has already been introduced, such as generalized least square optimization by incorporating spatial variance-covariance matrix, considering basis functions in the input nodes of the neural networks, and so on. However, for lattice data, there is no available literature about the utilization of asymptotic analysis of neural networks in regression for spatial data. This article proposes a consistent localized two-layer deep neural network-based regression for spatial data. We have proved the consistency of this deep neural network for bounded and unbounded spatial domains under a fixed sampling design of mixed-increasing spatial regions. We have proved that its asymptotic convergence rate is faster than that of \citezhan2024neural’s neural network and an improved generalization of \citeshen2023asymptotic’s neural network structure. We empirically observe the rate of convergence of discrepancy measures between the empirical probability distribution of observed and predicted data, which will become faster for a less smooth spatial surface. We have applied our asymptotic analysis of deep neural networks to the estimation of the monthly average temperature of major cities in the USA from its satellite image. This application is an effective showcase of non-linear spatial regression. We demonstrate our methodology with simulated lattice data in various scenarios.

[LG-59] Designing a Linearized Potential Function in Neural Network Optimization Using Csiszar Type of Tsallis Entropy

链接: https://arxiv.org/abs/2411.03611
作者: Keito Akiyama
关键词-EN: recent years, learning for neural, probability measures, neural networks, viewed as optimization
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Analysis of PDEs (math.AP)
*备注:

点击查看摘要

Abstract:In recent years, learning for neural networks can be viewed as optimization in the space of probability measures. To obtain the exponential convergence to the optimizer, the regularizing term based on Shannon entropy plays an important role. Even though an entropy function heavily affects convergence results, there is almost no result on its generalization, because of the following two technical difficulties: one is the lack of sufficient condition for generalized logarithmic Sobolev inequality, and the other is the distributional dependence of the potential function within the gradient flow equation. In this paper, we establish a framework that utilizes a linearized potential function via Csiszár type of Tsallis entropy, which is one of the generalized entropies. We also show that our new framework enable us to derive an exponential convergence result.

[LG-60] he Differentiable Feasibility Pump

链接: https://arxiv.org/abs/2411.03535
作者: Matteo Cacciola,Alexandre Forel,Antonio Frangioni,Andrea Lodi
关键词-EN: feasible primal solutions, find feasible primal, mixed-integer linear problems, years have passed, widely used heuristic
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Although nearly 20 years have passed since its conception, the feasibility pump algorithm remains a widely used heuristic to find feasible primal solutions to mixed-integer linear problems. Many extensions of the initial algorithm have been proposed. Yet, its core algorithm remains centered around two key steps: solving the linear relaxation of the original problem to obtain a solution that respects the constraints, and rounding it to obtain an integer solution. This paper shows that the traditional feasibility pump and many of its follow-ups can be seen as gradient-descent algorithms with specific parameters. A central aspect of this reinterpretation is observing that the traditional algorithm differentiates the solution of the linear relaxation with respect to its cost. This reinterpretation opens many opportunities for improving the performance of the original algorithm. We study how to modify the gradient-update step as well as extending its loss function. We perform extensive experiments on MIPLIB instances and show that these modifications can substantially reduce the number of iterations needed to find a solution.

[LG-61] Forecasting Outside the Box: Application-Driven Optimal Pointwise Forecasts for Stochastic Optimization

链接: https://arxiv.org/abs/2411.03520
作者: Tito Homem-de-Mello,Juan Valencia,Felipe Lagos,Guido Lagos
关键词-EN: data-driven optimization problems, exponential growth, availability in recent, recent years, years has led
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: Submitted for publication

点击查看摘要

Abstract:The exponential growth in data availability in recent years has led to new formulations of data-driven optimization problems. One such formulation is that of stochastic optimization problems with contextual information, where the goal is to optimize the expected value of a certain function given some contextual information (also called features) that accompany the main data of interest. The contextual information then allows for a better estimation of the quantity of interest via machine learning methods, thereby leading to better solutions. Oftentimes, however, machine learning methods yield just a pointwise estimate instead of an entire distribution. In this paper we show that, when the problem to be solved is a class of two-stage stochastic programs (namely, those with fixed recourse matrix and fixed costs), under mild assumptions the problem can be solved with just one scenario. While such a scenario - which does not have be unique - is usually unknown, we present an integrated learning and optimization procedure that yields the best approximation of that scenario within the modeler’s pre-specified set of parameterized forecast functions. Numerical results conducted with inventory problems from the literature (with synthetic data) as well as a bike-sharing problem with real data demonstrate that the proposed approach performs well when compared to benchmark methods from the literature.

[LG-62] Climate AI for Corporate Decarbonization Metrics Extraction

链接: https://arxiv.org/abs/2411.03402
作者: Aditya Dave,Mengchen Zhu,Dapeng Hu,Sachin Tiwari
关键词-EN: Corporate Greenhouse Gas, Greenhouse Gas, Corporate Greenhouse, sustainable investing, targets are important
类目: Portfolio Management (q-fin.PM); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Corporate Greenhouse Gas (GHG) emission targets are important metrics in sustainable investing [12, 16]. To provide a comprehensive view of company emission objectives, we propose an approach to source these metrics from company public disclosures. Without automation, curating these metrics manually is a labor-intensive process that requires combing through lengthy corporate sustainability disclosures that often do not follow a standard format. Furthermore, the resulting dataset needs to be validated thoroughly by Subject Matter Experts (SMEs), further lengthening the time-to-market. We introduce the Climate Artificial Intelligence for Corporate Decarbonization Metrics Extraction (CAI) model and pipeline, a novel approach utilizing Large Language Models (LLMs) to extract and validate linked metrics from corporate disclosures. We demonstrate that the process improves data collection efficiency and accuracy by automating data curation, validation, and metric scoring from public corporate disclosures. We further show that our results are agnostic to the choice of LLMs. This framework can be applied broadly to information extraction from textual data.

[LG-63] Solving stochastic partial differential equations using neural networks in the Wiener chaos expansion

链接: https://arxiv.org/abs/2411.03384
作者: Ariel Neufeld,Philipp Schmocker
关键词-EN: truncated Wiener chaos, Wiener chaos expansion, partial differential equations, solve stochastic partial, stochastic partial differential
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR)
*备注:

点击查看摘要

Abstract:In this paper, we solve stochastic partial differential equations (SPDEs) numerically by using (possibly random) neural networks in the truncated Wiener chaos expansion of their corresponding solution. Moreover, we provide some approximation rates for learning the solution of SPDEs with additive and/or multiplicative noise. Finally, we apply our results in numerical examples to approximate the solution of three SPDEs: the stochastic heat equation, the Heath-Jarrow-Morton equation, and the Zakai equation.

信息检索

[IR-0] Reproducible Hybrid Time-Travel Retrieval in Evolving Corpora

链接: https://arxiv.org/abs/2411.04051
作者: Moritz Staudinger,Florina Piroi,Andreas Rauber
关键词-EN: high reproducibility expectations, medical systematic reviews, downstream research tasks, reproducibility expectations, systematic reviews
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:There are settings in which reproducibility of ranked lists is desirable, such as when extracting a subset of an evolving document corpus for downstream research tasks or in domains such as patent retrieval or in medical systematic reviews, with high reproducibility expectations. However, as global term statistics change when documents change or are added to a corpus, queries using typical ranked retrieval models are not even reproducible for the parts of the document corpus that have not changed. Thus, Boolean retrieval frequently remains the mechanism of choice in such settings. We present a hybrid retrieval system combining Lucene for fast retrieval with a column-store-based retrieval system maintaining a versioned and time-stamped index. The latter component allows re-execution of previously posed queries resulting in the same ranked list and further allows for time-travel queries over evolving collection, as web archives, while maintaining the original ranking. Thus, retrieval results in evolving document collections are fully reproducible even when document collections and thus term statistics change. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2411.04051 [cs.IR] (or arXiv:2411.04051v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2411.04051 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3673791.3698421 Focus to learn more DOI(s) linking to related resources

[IR-1] Data Fusion of Synthetic Query Variants With Generative Large Language Models SIGIR

链接: https://arxiv.org/abs/2411.03881
作者: Timo Breuer
关键词-EN: retrieval effectiveness, Large Language Models, retrieval, beneficial for retrieval, query
类目: Information Retrieval (cs.IR)
*备注: The definitive version of record was published in SIGIR-AP '24

点击查看摘要

Abstract:Considering query variance in information retrieval (IR) experiments is beneficial for retrieval effectiveness. Especially ranking ensembles based on different topically related queries retrieve better results than rankings based on a single query alone. Recently, generative instruction-tuned Large Language Models (LLMs) improved on a variety of different tasks in capturing human language. To this end, this work explores the feasibility of using synthetic query variants generated by instruction-tuned LLMs in data fusion experiments. More specifically, we introduce a lightweight, unsupervised, and cost-efficient approach that exploits principled prompting and data fusion techniques. In our experiments, LLMs produce more effective queries when provided with additional context information on the topic. Furthermore, our analysis based on four TREC newswire benchmarks shows that data fusion based on synthetic query variants is significantly better than baselines with single queries and also outperforms pseudo-relevance feedback methods. We publicly share the code and query datasets with the community as resources for follow-up studies.

[IR-2] he Essence of the Essence from the Web:The Metasearch Engine

链接: https://arxiv.org/abs/2411.03701
作者: Rajender Nath,Satinder Bal
关键词-EN: turn continuing technological, continuing technological progress, Search Engines, Engines, multiple search engines
类目: Information Retrieval (cs.IR)
*备注: 6 pages

点击查看摘要

Abstract:The exponential growth of information source on the web and in turn continuing technological progress of searching the information by using tools like Search Engines gives rise to many problems for the user to know which tool is best for their query and which tool is not. At this time Metasearch Engine comes into play by reducing the user burden by dispatching queries to multiple search engines in parallel and refining the results of these search engines to give the best out of best by doing superior job on their side. These engines do not own a database of Web pages rather they send search terms to the databases maintained by the search engine companies, get back results from all the search engines queried and then compile the results to be presented to the user. In this paper, we describe the working of a typical metasearch engine and then present a comparative study of traditional search engines and metasearch engines on the basis of different parameters and show how metasearch engines are better than the other search engines.

[IR-3] Advanced RAG Models with Graph Structures: Optimizing Complex Knowledge Reasoning and Text Generation

链接: https://arxiv.org/abs/2411.03572
作者: Yuxin Dong,Shuo Wang,Hongye Zheng,Jiajing Chen,Zhenhong Zhang,Chihang Wang
关键词-EN: traditional RAG model, RAG model, complex graph structure, retrieval-augmented generation model, graph structure
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This study aims to optimize the existing retrieval-augmented generation model (RAG) by introducing a graph structure to improve the performance of the model in dealing with complex knowledge reasoning tasks. The traditional RAG model has the problem of insufficient processing efficiency when facing complex graph structure information (such as knowledge graphs, hierarchical relationships, etc.), which affects the quality and consistency of the generated results. This study proposes a scheme to process graph structure data by combining graph neural network (GNN), so that the model can capture the complex relationship between entities, thereby improving the knowledge consistency and reasoning ability of the generated text. The experiment used the Natural Questions (NQ) dataset and compared it with multiple existing generation models. The results show that the graph-based RAG model proposed in this paper is superior to the traditional generation model in terms of quality, knowledge consistency, and reasoning ability, especially when dealing with tasks that require multi-dimensional reasoning. Through the combination of the enhancement of the retrieval module and the graph neural network, the model in this study can better handle complex knowledge background information and has broad potential value in multiple practical application scenarios.

[IR-4] Automated LLM enabled extraction of synthesis details for reticular materials from scientific literature

链接: https://arxiv.org/abs/2411.03484
作者: Viviane Torres da Silva,Alexandre Rademaker,Krystelle Lionti,Ronaldo Giro,Geisa Lima,Sandro Fiorini,Marcelo Archanjo,Breno W. Carvalho,Rodrigo Neumann,Anaximandro Souza,João Pedro Souza,Gabriela de Valnisio,Carmen Nilda Paz,Renato Cerqueira,Mathias Steiner
关键词-EN: accelerate materials discovery, potentially accelerate materials, potentially accelerate, Knowledge Extraction Pipeline, scientific literature
类目: Materials Science (cond-mat.mtrl-sci); Information Retrieval (cs.IR)
*备注: 16 pages

点击查看摘要

Abstract:Automated knowledge extraction from scientific literature can potentially accelerate materials discovery. We have investigated an approach for extracting synthesis protocols for reticular materials from scientific literature using large language models (LLMs). To that end, we introduce a Knowledge Extraction Pipeline (KEP) that automatizes LLM-assisted paragraph classification and information extraction. By applying prompt engineering with in-context learning (ICL) to a set of open-source LLMs, we demonstrate that LLMs can retrieve chemical information from PDF documents, without the need for fine-tuning or training and at a reduced risk of hallucination. By comparing the performance of five open-source families of LLMs in both paragraph classification and information extraction tasks, we observe excellent model performance even if only few example paragraphs are included in the ICL prompts. The results show the potential of the KEP approach for reducing human annotations and data curation efforts in automated scientific knowledge extraction.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-11-07

目录

概览 (2024-11-07)

自然语言处理

人工智能

计算机视觉

机器学习

信息检索

附件下载